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Objectives  (The  same  as  in  the  original  proposal) 

In  a  large  number  of  cases,  we  have  to  assess  the  risk  of  chemicals  and  predict  the  toxic 
potential  of  molecules  in  the  face  of  limited  experimental  data.  Structural  criteria  and  functional 
criteria  (if  available)  are  routinely  used  to  estimate  the  possible  hazard  posed  by  a  chemical  to 
the  environment  and  ecosystem.  Frequently,  no  biological  or  relevant  physicochemical 
properties  of  the  chemical  species  of  interest  are  available  to  the  risk  assessor. 

In  the  proposed  project,  we  will  develop  and  implement  a  number  of  methods  of  quantifying 
molecular  similarity  of  chemicals  using  techniques  of  computational  and  mathematical 
chemistry.  Some  of  the  methods  are  new  and  will  be  based  on  our  own  research  on  the 
theoretical  development  and  implementation  of  molecular  similarity  methods.  These  techniques 
will  be  implemented  in  a  user  friendly  computer  environment  of  the  Silicon  Graphics 
workstation.  The  similarity  methods  will  be  used  to  select  analogs  of  chemicals  of  interest  to  the 
Air  Force,  viz.,  QUADRICYCLANE,  FLUOROCARBON  ETHERS  AND  THEIR  ANALOGS,  from 
databases  containing  high  quality  physicochemical  data  and  toxicity  endpoints  for  large  number 
of  chemicals.  The  databases  used  in  the  project  will  come  from  three  sources:  a)  public  domain 
databases,  b)  our  own  in-house  databases,  and  c)  databases  acquired  from  commercial 
vendors. 

The  set  of  selected  analogs,  called  probe-induced  subsets,  will  be  used  to:  a)  develop 
structure-activity  relationships  (SAR),  and  b)  carry  out  ranking  of  chemicals.  Both  of  these 
methods  will  be  used  to  estimate  the  hazard  of  the  chemicals  of  interest. 

A  set  of  chemicals  (five  to  ten)  will  be  chosen  for  experimental  work  with  the  purpose  of 
evaluating  and  refining  computer  models.  The  set  will  include  quadricyclane  and  fluorocarbon 
ethers  of  interest  to  the  Air  Force.  It  will  also  include  a  selection  of  analogs  (probe-induced 
subset)  that  are  readily  available,  suitable  for  experimentation,  and  for  which  data  are  lacking. 
Experiments  will  be  performed  to  assess  the  biodegradability  and  photochemical  degradability 
of  the  members  of  the  set.  Their  toxicity  will  be  tested  by  MicroTox  and  MutaTox.  In  cases 
where  significant  degradation  is  observed,  the  toxicity  of  the  degradation  products  will  also  be 
tested.  Direct  measurement  of  the  hydrophobicity  (octanol-water  partition  coefficient)  will  be 
performed  on  the  members  of  the  set. 


Status  of  Effort 

A  number  of  novel  molecular  similarity  methods  have  been  developed  using  topostructural  and 
topochemical  parameters  which  can  be  computed  directly  from  molecular  structure  using 
POLLY.  Topological  indices  (TIs),  atom  pairs  (APs),  geometrical  parameters,  and  semiempirical 
quantum  chemical  parameters  have  been  used  for  molecular  similarity  analysis  and 
development  of  hierarchical  QSAR  models.  The  relative  effectiveness  of  the  various  similarity 
techniques  in  selecting  analogs  and  estimating  properties  of  toxicological  importance  have  been 
tested  on  a  selected  set  of  properties  such  as  mutagenicity,  acute  toxicity,  lipophilicity  (logP, 
octanol/water),  etc.  The  K  nearest  neighbor  (KNN)  method,  K=1,  2, ...  25,  has  been  used  in 
generating  probe-induced  subsets  from  different  databases.  Results  show  that  the  KNN  method 
gives  the  best  estimate  of  properties  at  K  =  5-10  for  the  properties  studied. 

Seventy-five  probe-induced  subsets  have  been  generated  for  Quadricyclane  from  three 
different  databases:  a)  STARLIST  logP  database  of  Daylight,  Inc.,  containing  more  than  4,000 
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high  quality  logP  values,  for  b)  the  selection  was  restricted  to  Cj  cmpds  Available  Chemicals 
Directory  (ACD),  containing  over  180,000  chemicals  which  are  currently  available  from 
suppliers  worldwide,  and  c)  a  Chemical  Abstracts  Service  database  containing  about  120,000 
diverse  chemicals.  Some  of  the  selected  analogs  are  being  tested  in  the  laboratory  in  order  to 
determine  the  utility  of  analogs  in  predicting  properties  of  chemicals  from  the  properties  of  their 
neighbors  in  similarity  spaces. 


Accomplishments/New  Findings 

The  following  is  a  summary  of  accomplishments  of  the  various  tasks  of  the  project  during  the 
reporting  period. 

TASK  1:  Development  of  data  bases 

A  large  number  of  databases  relevant  to  toxicology  have  been  developed  from  published 
literature.  These  include  properties  like  teratogenicity,  inhibition  of  microsomal  and 
mitochondrial  oxygen  uptake  in  rat  cerebellum  by  chemicals,  minimum  inhibitory  concentration 
of  chemicals  for  DNA  gyrase  activity  in  E.  Coli,  ECgg  for  AHH  receptor  activation,  Ames 
mutagenicity,  Ito’s  test  for  carcinogenicity,  liver  carcinogenicity  in  rat/mice,  acute  toxicity  of 
various  pollutants  including  pesticides,  LCgo  in  guppy,  LC50  in  fathead  minnow,  skin  permeability 
of  chemicals,  lowest  observed  adverse  effect  levels  (LOAELs),  water  solubility,  soil  sorption 
coefficient,  toxicity  of  organophosphate  insecticides,  and  toxicity  of  respiratory  uncouplers. 

Many  of  these  data  have  structural/mechanistic  implications  for  toxicology.  Some  sets  of 
compounds  contain  a  specific  toxicophore  which  is  responsible  for  their  particular  toxic  action. 
QSAR  studies  can  show  how  the  effect  of  the  toxicophore  is  modulated  by  structural 
modifications.  On  the  other  hand,  some  toxicological  data  are  collected  on  common  biological 
endpoints  of  diverse  structural  types.  These  data  will  be  used  to  develop  similarity  and 
hierarchical  QSAR  models.  Mechanistic  data  developed  by  the  toxicology  group  at  the  Air  Force 
labs  will  be  used  to  validate  the  QSAR  models  generated  from  literature  data. 

TASK  2:  Development  of  methods  to  quantify  molecular  similarity 
New  molecular  similarity  methods  have  been  developed  using  topostructural  indices, 
topochemical  parameters,  atom  pairs  (APs)  and  geometrical  parameters.  A  hierarchical 
approach  to  the  quantification  of  molecular  similarity  has  been  developed  in  a  limited  scale. 

Principal  components  analysis  (PCA)  and  variable  clustering  methods  have  been  used  to  create 
orthogonal  structure  spaces  from  PQLLY  parameters.  AP  based  similarity  methods  have  also 
been  compared  with  PCA  based  methods  in  the  selection  of  analogs  and  prediction  of 
properties. 

The  following  publications  reported  the  result  of  molecular  similarity  analysis; 

1 )  Use  of  graph-theoretic  parameters  in  predicting  inhibition  of  microsomal  p-hydroxylation 
of  aniline  by  alcohols;  a  molecular  similarity  approach  by  Subhash  C.  Basak  and  Brian 
D.  Gute.,  pp.  492-504,  In:  Proceedings  of  the  i""  International  Congress  on  Hazardous 
Waste:  Impacts  on  Human  and  Ecological  Health,  B.L  Johnson,  C.  Xintaras,  J.S. 
Andrews,  Jr.,  Eds.,  Princeton  Scientific  Publishing  Co.  Inc.,  Princeton,  New  Jersey. 

1997. 

This  paper  compares  the  relative  effectiveness  of  the  Euclidean  distance  (ED)  and  AP 
methods  in  estimating  the  inhibition  of  microsomal  p-hydroxylation  of  aniline  by  alcohols. 
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2)  Estimation  of  the  normal  boiling  points  of  haloalkanes  using  molecular  similarity  by 
Subhash  C.  Basak,  Brian  D.  Gute,  and  Gregory  D.  Grunwald.  Croatia  Chemica  Acta 
69:1159-1173,  1996. 

This  paper  estimated  the  normal  boiling  points  of  a  set  of  267  haloalkanes  using 
molecular  similarity  methods. 

3)  Use  of  graph-theoretic  and  geometrical  molecular  descriptors  in  structure-activity 
relationships,  by  Subhash  C.  Basak,  Gregory  D.  Grun\wald,  and  Gerald  J.  Niemi.  pp.  73- 
116,  In:  From  Chemical  Topology  to  Three-Dimensional  Geometry,  ed.  A.T.  Balaban, 
Plenum  Press,  New  York,  1997. 

This  book  chapter  presents  a  comprehensive  review  of  the  utility  of  topological  indices  in 
QSAR  and  the  quantification  of  intermolecular  similarity. 

4)  Characterization  of  the  molecular  similarity  of  chemicals  using  topological  invariants,  by 
Subhash  C.  Basak,  Brian  D.  Gute,  and  Gregory  D.  Grunwald.  In:  Advances  in  Moiecular 
Similarity,  JAI  Press,  submitted,  1997. 

This  paper  analyzed  the  utility  of  topostructural  and  topochemical  indices  in  the 
quantification  of  molecular  similarity  and  selection  of  analogs. 

Copies  of  the  above  mentioned  papers  are  attached  (Vide  Infra,  Publication  Section) 

TASK  3  Selection  of  analogs 

Analogs  or  "probe-induced  subsets"  selected  from  databases  with  good  quality  experimental 
data  can  be  useful  in  predicting  properties  of  probe  chemicals.  Taking  Quadricyclane  as  the 
probe,  we  selected  75  analogs  using  different  search  methods.  The  results  of  such  analyses 
have  been  previously  reported.  Various  molecular  similarity  methods  have  also  been  used  in 
the  selection  of  neighbors  for  KNN  based  property  estimation. 

TASK  4 

A.  Estimation  of  properties  of  the  target  chemical  from  the  probe-induced  subset 
We  studied  the  effectiveness  of  similarity  methods  developed  in  Task  2  above  by  applying 
these  methods  in  estimating  various  physicochemical  and  toxicological  endpoints.  To  this  end, 
we  carried  out  similarity-based  estimation  of  physicochemical  and  toxicological  properties. 

Three  papers  were  submitted  out  of  the  research  carried  out  in  this  task.  These  results  were 
also  presented  in  numerous  national  and  international  symposia  and  invited  presentations. 

a.  Use  of  graph-theoretic  parameters  in  predicting  inhibition  of  microsomal  p- 
hydroxylation  of  aniline  by  alcohols:  a  molecular  similarity  approach,  by  Subhash  C. 
Basak  and  Brian  D.  Gute.,  pp.  492-504,  In:  Proceedings  of  the  2'’^  International 
Congress  on  Hazardous  Waste:  Impacts  on  Human  and  Ecological  Health,  B.L. 
Johnson,  C.  Xintaras,  J.S.  Andrews,  Jr.,  Eds.,  Princeton  Scientific  Publishing  Co. 
Inc.,  Princeton,  New  Jersey,  1997. 

b.  Estimation  of  the  normal  boiling  points  of  haloalkanes  using  molecular  similarity  by 
Subhash  C.  Basak,  Brian  D.  Gute,  and  Gregory  D.  Grunwald.  Croatia  Chemica  Acta, 
69:1159-1173,  1996. 

c.  Characterization  of  the  molecular  similarity  of  chemicals  using  topological  invariants, 
by  Subhash  C.  Basak,  Brian  D.  Gute,  and  Gregory  D.  Grunwald.,  In:  Advances  in 
Molecular  Similarity,  JAI  Press,  submitted,  1997. 
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B.  Hierarchical  approach  to  toxicity  estimation  using  topological,  geometrical,  and  quantum 
chemical  parameters 

We  have  also  been  developing  a  hierarchical  approach  to  computational  toxicology  using 
topostructural,  topochemical,  geometrical,  and  quantum  chemical  parameters  which  can  be 
calculated  directly  from  molecular  structure.  This  approach  uses  increasingly  more  complex 
parameters  to  estimate  properties  of  chemicals,  as  necessary  for  a  particular  situation.  We 
have  listed  below  the  book  chapters/papers  in  peer-reviewed  journals  which  have  reported 
these  results. 

5)  A  comparative  study  of  topological  and  geometrical  parameters  in  estimating  normal 
boiling  point  and  octanol/water  partition  coefficient.  Subhash  C.  Basak,  Brian  D.  Gute, 
and  Gregory  D.  Grunwald.  J.  Chem.  Inf.  Comput.  Sci.  36:1054-1060, 1996. 

This  paper  used  topostructural,  topochemical  and  geometrical  parameters  in  the 
development  of  hierarchical  QSAR  models  for  predicting  logP  (octanol/water)  and 
boiling  point. 

6)  Use  of  topostructural,  topochemical  and  geometric  parameters  in  the  prediction  of  vapor 
pressure:  a  hierarchical  QSAR  approach,  S.  C.  Basak,  B.  D.  Gute  and  G.  D.  Grunwald, 
J.  Chem.  Inf.  Comput.  Sci.,  37;  651-655, 1997. 

This  paper  utilized  a  hierarchical  QSAR  approach  in  estimating  vapor  pressure  of  a 
diverse  set  of  476  chemicals. 

7)  Predicting  acute  toxicity  (LC50)  of  benzene  derivatives  using  theoretical  molecular 
descriptors:  a  hierarchical  QSAR  approach,  B.  D.  Gute  and  S.  C.  Basak,  SAR  QSAR 
Environ.  Res.,  in  press,  1997. 

This  paper  used  a  hierarchical  QSAR  approach  in  the  estimating  acute  toxicity  of  a  set 
69  of  benzene  derivatives.  Topostructural,  topochemical,  geometrical  and  quantum 
chemical  parameters  were  used  as  independent  variables. 

8)  Characterization  of  molecular  structures  using  topological  indices,  S.C.  Basak  and  B.D. 
Gute;  SAR  QSAR  Environ.  Res.,  in  press,  1997. 

9)  The  relative  effectiveness  of  topological,  geometrical,  and  quantum  chemical 
parameters  in  estimating  mutagenicity  of  chemicals,  S.  C.  Basak,  B.  D.  Gute  and  G.  D. 
Grunwald.  In:  Proceedings  of  the  Seventh  International  Workshop  on  Quantitative 
Structure- Activity  Relationships  in  Environmental  Sciences,  SETAC  Press,  in  press, 
1997. 

This  paper  used  a  hierarchical  approach  in  estimating  mutagenicity  of  chemicals. 

Copies  of  the  above  papers  are  attached  (Vide  Infra,  Publication  Section) 

Task  5  Measurement  of  hydrophobicity 

We  measured  the  octanol-water  partition  coefficient  (P)  for  1 5  analogs.  Because  the  application 
of  the  retention-time  method  (see  the  Annual  Report  for  Year  2)  gave  values  of  logP  about  an 
order-of-magnitude  greater  than  those  predicted  by  CLQGP,  we  considered  it  worthwhile  to  do 
the  measurements  thoroughly.  The  results  are  shown  in  Table  1 .  For  eight  of  the  compounds 
we  used  both  a  stirring  method  and  a  shake-flask  method;  the  results  from  both  methods  agree 
well.  We  filled  a  gap  around  logP  =  2  with  the  compound  2-norbornane  methanol  (a  mixture  of 
exo  and  endo).  Figure  1  gives  a  plot  of  measured  logP  vs  estimated  logP  (CLQGP)  for  the 
analogs  tested. 
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Task  6  Microbial  degradation  studies 

Biodegradation  and  toxicity  of  quadricyclane  and  its  six  analogs  selected  by  molecular  similarity 
methods  have  been  determined  (See  Appendix  1  for  details).  Results  indicate  that  both 
quadricyclane  and  its  selected  analogs  are  readily  degradable. 

Task  7  Photochemical  Degradation  Studies 

We  carried  out  photochemical  reactions  with  hydrogen  peroxide  on  6  additional  compounds, 
viz.  endo-norborneol,  exo-norborneol,  3,5-dihydroxytricyclo[2.2.1 .0]heptane, 
2,7-norbornanediol,  dicyclopropylcarbinol  and  c/s-exo-2,3-norbornanediol.  (The  experimental 
details  are  given  in  the  Annual  Report  for  Year  2).  We  observed  no  significant  reactions  in 
these  cases;  the  data  are  not  shown. 


Personnel  Supported 

University  of  Minnesota,  Duluth  Subhash  Basak,  Keith  Lodge,  Greg 

Grunwald,  Gloria  Bly,  and  A.  Hayford 

University  of  South  Carolina  Joseph  Schubauer-Berigan,  and  Darcy 

Wood 


Publications 

The  following  publications,  which  are  currently  either  published,  accepted  for  publication  or  in 

submission,  report  results  of  QSAR/QMSA  analyses  which  were  supported  by  the  AFOSR 

grant: 

1 .  Use  of  graph-theoretic  parameters  in  predicting  inhibition  of  microsomal  p-hydroxylation 
of  aniline  by  alcohols:  a  molecular  similarity  approach.  Subhash  C.  Basak  and  Brian  D. 
Gute.,  pp.  492-504,  In:  Proceedings  of  the  2"''  International  Congress  on  Hazardous 
Waste:  Impacts  on  Human  and  Ecological  Health,  B.L.  Johnson,  C.  Xintaras,  J.S. 
Andrews,  Jr.,  Eds.,  Princeton  Scientific  Publishing  Co.  Inc.,  Princeton,  New  Jersey, 
1997. 

2.  Estimation  of  the  normal  boiling  points  of  haloalkanes  using  molecular  similarity. 
Subhash  C.  Basak,  Brian  D.  Gute,  and  Gregory  D.  Grunwald.  Croatia  Chemica  Acta, 
69:1159-1173,  1996. 

3.  Use  of  graph-theoretic  and  geometrical  molecular  descriptors  in  structure-activity 
relationships.  Subhash  C.  Basak,  Gregory  D.  Grunwald,  and  Gerald  J.  Niemi.  pp.  73- 
116,  In:  From  Chemical  Topology  to  Three-Dimensional  Geometry,  ed.  A.T.  Balaban, 
Plenum  Press,  New  York,  1997. 

4.  Characterization  of  the  molecular  similarity  of  chemicals  using  topological  invariants, 

S.  C.  Basak,  B.  D.  Gute,  and  G.  D.  Grunwald,  In:  Advances  in  Molecular  Similarity,  JAI 
Press,  submitted,  1997. 

5.  A  comparative  study  of  topological  and  geometrical  parameters  in  estimating  normal 
boiling  point  and  octanol/water  partition  coefficient.  Subhash  C.  Basak,  Brian  D.  Gute, 
and  Gregory  D.  Grunwald.  J.  Chem.  Inf.  Comput.  Sci.,  36:1054-1060,  1996. 
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6.  Use  of  topostructural,  topochemical  and  geometric  parameters  in  the  prediction  of  vapor 
pressure:  a  hierarchical  QSAR  approach,  S.  C.  Basak,  B.  D.  Gute  and  G.  D.  Grunwald. 
J.  Chem.  Inf.  Comput.  Sci.,  37:651-655, 1997. 

7.  Predicting  acute  toxicity  (LC50)  of  benzene  derivatives  using  theoretical  molecular 
descriptors:  a  hierarchical  QSAR  approach,  B.  D.  Gute  and  S.  C.  Basak.  SAR  QSAR 
Environ.  Res.,  in  press,  1997. 

8.  Characterization  of  molecular  structures  using  topological  indices,  S.C.  Basak  and  B.D. 
Gute.  SAR  QSAR  Environ.  Res.,  in  press,  1997. 

9.  The  relative  effectiveness  of  topological,  geometrical,  and  quantum  chemical 
parameters  in  estimating  mutagenicity  of  chemicals,  S.  C.  Basak,  B.  D.  Gute  and  G.  D. 
Grunwald.  In:  Proceedings  of  the  Seventh  International  Workshop  on  Quantitative 
Structure-Activity  Relationships  in  Environmental  Sciences,  SETAC  Press,  in  press, 
1997. 

10.  On  the  relationship  between  the  organic-carbon  normalized  sediment,  or  soil  sorption 
coefficient  and  the  octanol-water  partition  coefficient.  K.  B.  Lodge.  Res.  Notes, 
submitted,  1997. 


Interactions/transitions 

A.  Participation/Presentations 

1 .  Subhash  C.  Basak  and  Brian  D.  Gute  presented  an  invited  lecture  at  the  international 
symposium  organized  for  the  1995  Herman  Skolnick  award  in  chemical  information.  The 
symposium  was  part  of  the  American  Chemical  Society  meeting,  Orlando,  Florida, 
August  25-29,  1996. 

2.  Subhash  C.  Basak  and  Brian  D.  Gute  gave  an  invited  presentation  "Quantitative 
Molecular  Similarity  Analysis  (QMSA)  and  Tdxicity  Prediction"  at  the  US  Air  Force 
Conference  "Chemistry  and  Toxicology  of  Candidate  Deicers"  organized  by  the 
Materials  Directorate  of  Wright  Patterson  Air  Force  Base  (WPAFB),  Dayton,  OH.  While 
there.  Dr.  Basak  also  attended  the  Air  Force  Office  of  Scientific  Research  (AFOSR) 
Dermal  Focus  Group  Meeting  organized  at  WPAFB,  August  6-7, 1996. 

3.  Subhash  C.  Basak  presented  a  seminar  “QSAR/QMSA  using  nonempirical  parameters: 
applications  in  predictive  toxicology  and  drug  discovery"  at  the  Abbott  Laboratories, 
Chicago,  September  22-23, 1 996. 

4.  Brian  D.  Gute,  Subhash  C.  Basak  and  Greg  D.  Grunwald  gave  a  presentation  entitled 
"Development  of  QSARs  of  bioactive  molecules  using  a  hierarchical  approach"  at  the 
American  Chemical  Society  31®‘  Midwest  Regional  meeting,  November  6-8,  1996. 

5.  Subhash  C.  Basak  gave  a  presentation  entitled  "Development  of  QMSA  and  QSAR 
methods  for  hazard  assessment  of  chemicals:  tools  for  computational  toxicology"  at  the 
Air  Force  Office  of  Scientific  Research  (AFOSR)  Toxicology  Program  Review, 
December  12-13, 1996,  Fairborn,  Ohio. 
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6.  Subhash  C.  Basak  presented  a  seminar  "Computational  chemical  graph  theory  and  its 
practical  applications"  in  the  Scientific  Computing  Seminar  Laboratory  for  Intelligent 
Systems  -  ECE  Dept,  and  CSc  Dept.,  University  of  Minnesota,  Duluth  on  January  29, 
1997. 

7.  Subhash  C.  Basak,  Brian  D.  Gute  and  Greg  D.  Grunwald  presented  an  invited  paper 
entitled  "Use  of  nonempirical  structural  descriptors  in  QSAR"  in  the  session 
"Mathematical  approaches  to  QSAR  and  predictive  toxicology"  of  the  1 1*'’  International 
Conference  on  Mathematical  and  Computer  Modelling  and  Scientific  Computing  in 
Washington,  DC,  March  27-April  3,  1997. 

8.  Subhash  C.  Basak,  Brian  D.  Gute,  and  Greg  D.  Grunwald  presented  an  invited  paper 
entitled  "Use  of  theoretical  molecular  descriptors  in  structure-property  and  structure- 
activity  studies"  at  the  7'*’  International  Conference  on  Mathematical  Chemistry  and  3’"^ 
Girona  Seminar  on  Molecular  Similarity,  Girona,  Spain,  May  26-31, 1997. 

9.  Subhash  C.  Basak  presented  an  invited  lecture  entitled  "Prediction  of  physicochemical 
and  toxicological  properties  of  chemicals  using  theoretical  molecular  descriptors"  at 
Moscow  State  University,  Moscow,  Russia,  June  30, 1997. 

10.  Subhash  C.  Basak,  Brian  D.  Gute,  and  Greg  D.  Grunwald  gave  an  invited  lecture 
entitled  "Predicting  bioactivity  of  chemicals  from  structure;  a  hierarchical  QSAR 
approach"  to  the  Department  of  Biochemistry,  University  of  Calcutta,  Calcutta,  India, 
July  30,  1997. 

B.  Consultative  and  Advisory  Functions 

Subhash  C.  Basak  was  invited  to  become  a  member  of  the  National  Advisory  Board  of  the 

Association  of  Ayurvedic  Doctors  of  India  (AADI). 

C.  Transitions 

1 .  Computational  methods  were  applied  in  the  design  of  new  anti-epileptic  compounds  in 
cooperation  with  Professor  Alexandru  T.  Balaban,  Vice  President,  Rumanian  Academy 
of  Sciences. 

2.  Applied  similarity  and  QSAR  methods  in  the  design  of  novel  and  benign  deicing  agents 
working  in  cooperation  with  Professor  George  Mushrush,  Department  of  Chemistry, 
George  Mason  University,  Washington  D.C. 


New  Discoveries 

1 .  Hierarchical  QSAR  research  using  topostructural,  topochemical,  and  geometrical 
parameters  showed  that  the  first  two  classes  of  parameters  explain  most  of  the  variance 
in  the  data  of  toxicological  and  physicochemical  properties. 

2.  It  was  observed  that  similarity  spaces  derived  from  topostructural  and  topochemical 
parameters  have  distinct  analog  selection  characteristics. 
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Honors/Awards 


1 .  Subhash  C.  Basak  was  invited  to  become  one  of  six  invited  speakers  at  the  international 
symposium  organized  for  the  1995  Herman  Skolnick  Award  in  Chemical  Information. 
The  symposium  was  held  during  the  American  Chemical  Society  Meeting,  Orlando, 
Florida,  August  25-29, 1 996,  to  honor  Milan  Randic,  the  recipient  of  1 995  Herman 
Skolnic  Award. 

2.  Subhash  C.  Basak  was  invited  to  chair  and  organize  two  sessions  at  the  11”^ 
International  Conference  on  Mathematical  and  Modelling  and  Scientific  Computing, 
March  31 -April  3, 1997,  Georgetown  University,  Washington,  DC. 

3.  Subhash  C.  Basak  was  invited  to  edit  a  special  volume  of  the  journal  Mathematical 
Modelling  and  Scientific  Computing  dealing  with  the  mathematical  aspects  of  QSAR  and 
predictive  toxicology. 

4.  Subhash  C.  Basak  was  invited  to  become  a  member  of  the  Organizing  and  Scientific 
Committee  for  the  International  Conference  on  Mathematical  and  Computer  Modelling 
and  Scientific  Computing. 

5.  Subhash  C.  Basak  was  invited  to  present  a  lecture  on  molecular  similarity  at  the  7‘^ 
International  Conference  on  Mathematical  Chemistry  and  3'‘‘  Girona  Seminar  on 
Molecular  Similarity,  Girona,  Spain,  May  26-31, 1997. 

6.  Subhash  C.  Basak  has  been  invited  to  deliver  a  plenary  lecture  at  the  17*'’  Annual 
Convention  of  the  Indian  Association  for  Cancer  Research  and  National  Symposium  on 
Breast  Cancer  to  be  held  in  Calcutta,  India,  January  21-24, 1998. 


APPENDIX  1. 

Annual  progress  report  of  the  University  of  South  Carolina  subcontract  for  the  AFOSR  grant  F 
49620-94-1401 

APPENDIX  2. 
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Table  L  Direct  Measurements  of  Log  P 
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Task  6  Microbial  degradation  studies 

Prop-ess  this  year  In  past  year  we.have.primarily  focused  on  the. quantifying 
the  biodegradation  and  toxicity  of  ODC  and  6  of  its  analogs  (identified  by  the.work  of 
Dr.  Basak).  In  correspondence,  to  ihe.AFOSR  in  March  and  May  1997  we  requested  a 
3  month  no  cost  project  extension  and  permission  to  reprogrcim  some,  our 
remaining  grant  funds  to  buy  a  liquid  autosampler  and  computer  for  the.  gas 
chromatograph  to  e>5>edite.  sample  analysis.  Both  requests  were  granted  in  June. 

The.  equipment  was  purchased  in  June,  and  in  place  and  functioning  by  the  end  of 
July  1997.  This  equipment  has  allowed  us  to  complete,  a  substantial  portion  of  five 
biodegradation  work  which  is  a  m^or  objective  of  the  grant. 

Biodepadation  experiments  We  have  finished  examining  the  aerobic 
biodegradation  of  all  seven  compounds  in  fresh  water  wetland  sediments  and  water, 
in  live,  and  dead  incubations.  The  degradation  of  each  dvemical  was  examined  in 
separate  time,  course  experiments  each  run  for  33  days.  As  we  reported  earlier,  we 
went  to  great  lengths  to  characterize,  tihe.  sediments  used  in  fiiese.  experiments  for  a 
variety  of  parameters  that  could  influence,  flie.  rates  of  chemical  degradation. 
Preliminary  analpis  of  the.e)q)eriments  suggests  very  similar  degradation  rates 
among  the  chemicals,  but  analysis  of  the  data  from  fiie  ejqjeriments  is  incomplete. 

In  September  1997  we  will  be  completing  fiie  analysis  of  die  experiments.  We  also 
hope  to  complete  a  series  of  esqieriments  examining  the.  anaerobic  degradation  of 
QDC  in  fresh  water  and  sediments  and  tihe  degradation  of  QDG  in  saline,  water  and 
sediments. 

Toxicitu  determinations-  We.have  already  assessed  the. toxicity  of  QDC  and  it's 
analogs  using  BOD's  and  natural  bacterial  CPU's.  This  past  year  we.  developed  a  new 
and  novel  approach  to  assess  the  toxicity  and  environmental  risk  of  these  chemicals 
to  natural  bacterial  community  function  utilizing  BIOLOG  plates.  Briefly,  the 
method  involves  directly  incubating  natural  water  samples  titrated  with  the 
chemical  of  interest  in  BIOLOG  plates.  The  plates  contain  95  different  carbon 
substrates.  We  monitor  the.  resulting  community-dependent  substrate  utilization 
patterns  by  the  iireversible.reduction  of  a  tetrazolium  dye  associated  with  each 
substrate.  In  this  way  we.  can  quantity^  the.  effect  or  toxicity  of  a  chemical  on  the 
metabolism  of  classes  of  substrates  by  natural  micxobial  communities.  By 
examining  fiie.  intensity  of  the  dye.  reduction  over  a  period  of  days  we  can  also  assess 
the  efrect  the.  chemical  has  on  the.  average,  rate,  of  substrate,  metabolism  by  the 
micxobial  community.  This  approach  is  exciting  becnuse  it  give  us  a  way  to  rapidly, 
directly  and  specifically  assess  the  risk  associated  with  these,  chemicals  to  natural 
ecosystem  function.  In  comparison,  one  of  the.  Other  approaches  we.  are  using  to 
assess  the  toxidty  of  the  chemicals  is  Microtox.  Although  this  approach  works  well 
in  comparing  the  relative  toxicity  of  QDC  and  its  analogs  and  potentially  other 
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chemicals,  one.  of  the  major  weaknesses  of  Microtox  is  its  inability  to  directly  relate 
ti\e  results  to  environmental  risk  in  the  field. 

We.  have  now  completed  tests  of  all  of  the  diemicals  using  the.  modified 
BIOLOG  approach  we  developed,  but  analysis  of  the  data  is  not  complete.  The 
toxidty  assessment  of  the  chemicals  using  the  Microtox  method  is  underway  but  not 
completed.  Preliminary  results  of  modified  BIOLOG  tests  suggest  widespreil 
disruption  of  natural  microbial  community  metabolism  at  concentrations  greater 
then  appro;^ately  250-300  mg/L  for  QDC  and  its  analogs.  Preliminary  Microtox 
assays  also  indicate,  toxicity  at  similar  concentrations.  In  comparison,  previous 
experiments  showed  no  clear  effect  of  the.  chemicals  on  BOD's  or  bacterial  spedes  or 
numbers  (CPU's  and  direct  cormts). 

One.  continuing  experi^ntal  problem  we  are.  having 
is  directly  related  to  the  relatively  low  water  solubil^J^^ass  of  diemicals  we.  are 
examining.  This  has  prevented  us  from  keying  the  chemicals  in  solution  at  the 
highest  test  levels  (>300-400  mg/1).  We. have. partially  gotten  around  this  by  using 
eth^ol  to  carry  the  chemicals.  However  at  tiie.  highest  concentrations  we  are.  also 
limited  by  the  toxicity  of  the.  carrier  solvents  themselves  which  we  have  empirically 
determined. 

Euture.  plana  In  the  time  remaining  on  file  grant,  we  plan  to:  1)  finish  the 
Microtox  toxidty  tests;  2)  finish  the. biodegradation  studies  mentioned  above;  and 

3)  finish  analyzing  the.  data  from  the.  toxidty  and  biodegradation  studies.  We  expect 
at  least  toe.  publications  to  come,  from  the.  Microbial  experiments:  1)  a  synthesis 
paper  witti  tt^^other  FI's  of  die. study;  2)  a  paper  describing  the  degradation  patterns 
of  QDC  wimlmd  the. analogs  authored  jointly  widi  Dr.  Lodge;  and  3)  a  paper 
comparing  toxidty  of  QDC  and  its  analogs,  including  a  description  of  our  modified 
BIOLOG  approadi  to  assess  chemical  risk. 

4)  Accomplishments /New  Findings:  The.  research  completed  dius  far  indicates  that 
Quadricydane.and  it's  analogs  (selected  using  QSAR-SAR  methods)  all  appear  to 
degrade  rapidly,  primarily  abiotically,  in  water  and  sedim«it.  Toxidty  of  QDC  and 
its  analogs  to  natural  microbial  community  function  was  noted  using  a  newly 
developed  zissessm«it  method. 

5)  Personnel  Supported:  Dr.  Joseph  P.  Schubauer-Berigan,  Project  Prindpal 
Investigator,  22%  effort;  Ms.  Darcy  Wood,  project  technidan,  100%  effort. 

6)  Publications:  None,  during  this  period. 

7)  Interactions /Transactions:  None,  during  tins  period. 

8)  New  discoveries,  inventions  or  patent  disclosures:  None,  during  this  period. 

9)  Honors  and  Awards: 

R.A.  Sheldon  Scholarship,  University  of  Georgia,  1986 
Research  Internship,  University  of  Georgia  Marine  Institute,  1986 
Regents  Award  fox  Outstanding  Teaching  and  Research,  U.  GA,,  1985 
University-wide  Fellowships,  University  of  Georgia,  1984,86,87 
Research  Fellowship,  Savannah  River  Ecology  Laboratory,  SC,  1977-78 
NSF/  AEC  Research  Internship,  Savannah  River  Ecology  Laboratory,  SC,  1976 
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Use  of  Graph-Theoretic  Parameters  in  Predicting  Inhibition 
OF  Microsomal  p-Hydroxylation  of  Aniline  by  Alcohols: 

A  Molecular  Similarity  Approach' 

Subkash  C  Basak,  Brian  D  Gute,  Natural  Resources  Research  Institute,  University  of 
Minnesota,  Duluth 


Introduction 

Environmental  and  human  health  risk  assessment  of  chemicals  is  often  carried  out  using  insufficient 
experimental  data.  This  is  true  for  the  laige  number  of  industrial  chemicals,  as  well  as  for 
substances  identified  in  industrial  effluent,  hazardous  waste  sites  and  environmental  monitoring 
surv^  (Auer  et  al.  1990).  In  1984,  the  National  Research  Council  studied  the  availability  of 
toxicity  data  on  industrial  diemicals,  and  found  that  many  of  these  chemicals  have  little  or  no  test 
data  ^^C 1984).  About  13  million  distinct  chemicals  have  been  roistered  with  Chemical  Abstract  • 
Service  (CAS),  and  the  list  is  growing  by  nearly  500,000  per  year.  Out  of  these  chemicals,  about 
1,000  enter  into  societal  use  every  year  (Arcos  1987).  Few  of  these  chemicals  are  submitted  with 
the  mpirical  data  necessary  for  risk  assessment.  In  the  United  States,  the  Toxic  Substances  Control 
Act  (ISCA)  Inventory  has  over  74,000  entries,  and  the  list  is  growing  by  nearly  3,000  per  year 
(Auer  et  al.  1990,  TSCA  1976).  Of  the  j^proximately  3,000  chenucals  submitted  yearly  to  the 
United  States  Environmental  Protection  Ag«icy  (ERA)  for  the  premanufecture  notification  (PMN) 
process,  more  than  50%  have  no  expedmenM  data  at  all,  less  than  15%  have  empirical 
mutagenicity  data,  and  only  about  6%  have  experimental  ecotoxicological  and  environmental  fote 
data.  This  dearth  of  empirical  data  is  also  true  for  mai^  of  the  over  700  chemicals  found  on  the 
Superftmd  list  of  hazardous  substances  (Auer  et  al.  1990). 

A  large  numb^  of  physicochemical  and  biological  test  data  on  chemicals  are  a  prerequisite  to 
the  prop^  estimation  of  the  hazards  posed  by  a  chemical  species.  Table  1  gives  a  partial  list  of 
such  properties.  As  a  residt  of  this  lack  of  relevant  data,  a  variety  of  structural,  physicodiemical, 
and  biochenucal  properties  are  used  in  hazard  estimation.  For  example  in  assessing  the 
carcinogenic  potential  of  diemicals,  tiiree  classes  of  criteria  have  been  used  ty  expels: 

O  Structural 

□  Functional 

□  Guilt  by  association 


*  This  is  contribution  number  154  from  tile  Cente  for  Water  and  tiie  Enrironment  of  the  Natutal 
Resources  Reseaidi  Institufe.  Research  r^rorted  in  this  paper  was  supported,  in  part,  by  grant  F49620-94-1- 
0401  from  tire  United  States  Air  Force,-  Cor^tative  Agreement  CR-S19621  from  tire  United  States 
Enyironmental  Protection  A^ncy,  Exxon  Biomedical  Sciences,  Inc.  and  tiie  Structure-Acti^tiy  Relationslup 
Consortium  (SARCON)  of  the  Natural  Resources  Research  Institute  at  the  University  of  Miimesota.  The 
authors  would  also  like  to  octend  tiieir  thanks  for  Greg  Gnurwald’s  helpful  discussions. 
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Structural  criteria  consist  of  structural  analogy  of  a  chemical  species  with  known  and  well- 
established  chemical  carcinogens.  Structural  factors  consid^ed  could  be  molecular  size,  shape, 
branching  pattern,  symmetry,  and  charge,  to  list  just  a  few.  Frequently,  structural  characteristics 
of  molecules  are  not  enough  for  reliable  estimation  of  carcinogenic  risk.  Functional  criteria,  that 
is,  results  of  short-term  tests,  msy  often  supplement  structural  criteria  to  more  reliably  estimate  the 
hazard.  Ames  test,  mammalian  <^1  transformation,  and  unscheduled  DNA  synthesis  or  pattern  of 
regulation  of  gene  expression  are  examples  of  frequently  used  functional  criteria  relevant  to 
carcinogenic  risk  assessm^t.  The  guilt-ty-association  criteria  consider  tiiat,  even  thou^  a  chemical 
msy  have  been  found  to  be  inactive  throu^  normal  testing  (bioassays),  it  m^  need  fo  be  retested  more 
stringently  if  it  belongs  to  a  class  of  compounds  which  contains  potent  carcinogens  (Arcos  1987). 

In  the  assessment  of  environmental  hazards,  often  the  physicochemical  and  biological  test  data 
essential  to  hazard  estimation  are  unavailable.  In  such  cases,  regulators  use  a  two-tiered  approach 
to  predict  hazard  from  chemical  structure:  class-sp^ific  quantitative  structure-activity  relationship 
(QSAR)  models  and  chemical  analogs  (Auer  et  al.  1990). 

Table  1*  list  of  properties  necessary  for  risk  assessment  of  chemicals. 


Physicochemical  Biological 


Molar  Volume 
Boiling  Point 
Melting  Point 
V^r  Pressure 
Aqueous  Solubility 
Dissociation  Constant  (pKa) 
Partition  Coefficient 

:Octanol-water  (log  P) 
:Air-\^hter 
:Sediment-^ter 
Reactivity  (Electrophile) 


Receptor  Binding  (Ko) 
Michaelis  Constant  0^) 
Inhibitor  Constant  (K) 
Biodegradation 
Bioconcentiation 
Alkylation  Profile 
Metabolic  Profile 
Chronic  Toxicity 
CaicinogMiicity 
Mutagenicity 
Acute  Toxicity 

:LD5o 

:IA, 


QSARs  are  mathematical  models  that  use  various  quantifiers  of  chemical  structure  and  empirical ' 
param^rs  (or  properties)  in  predicting  physicochemical  and  biological  properties  of  moleoiles 
^asaketal.  1990,,Hansch  1976).  In  dass-specific  QSARs,  a  chemical  is  first  assigned  a  specific 
structural  class  and  die  QSAR  of  that  particular  class  of  chemicals  is  used  to  predict  the  potential 
toxicity  of  the  molecule  of  interest. 

If  a  chemical  is  very  complex,  that  is,  contains  maiqr  functional  groups,  a  simplistic  attempt  at 
classification  is  almost  certain  to  fidl.  The  use  of  class-specific  QSARs  in  hazard  assessment  of 
such  chemicals  will  be  limited.  In  such  cases,  one  resorts  to  the  approach  of  sdecting  analogs  of 
the  chmical  of  interest  and  estimating  the  hazardous  potential  of  the  chemical  from  the  toxicity  of 
its  analogs.  Analogs  of  new  chemicals  are  routinely  used  l^  r^latory  agencies  like  EPA  in  hazard 
assessment  (Auer  et  al.  1990).  Oienucal  X  is  considered  to  be  an  analog  of  (or  similar  to)  chenucal 
Y  if  X  and  Y  resemble  each  other  in  one  or  more  critical  aspects,  that  is,  structurally,  stereo- 
electronically,  or  physicochemically.  The  use  of  analogs  is  based  on  the  tacit  assumption  that  sumlar 
structures  have  similar  properties  (Johnson  et  al.  1988,  Johnson  and  Maggiora  1990). 
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A  perusal  of  approaches  used  in  carcinogenic  risk  assessment  and  in  environmental  and 
ecotoxicological  hazard  estimation  indicates  that  the  candidate  dtwmcal  is  con^ar^  widi  known 
toxicants,  using  structural  or  fimctional  criteria.  Experts  often  select  tiiese  analogs  (structurally 
related  themicals)  based  on  their  individual  judgements  and  a  sdected  set  of  structural  features.. 

Chemical  analogs  can  be  selected  using  empirical  descriptors  or  theoretical  molecular  descriptors 
^asak  and  Grunwald,  In  Press).  The  paucity  of  ax'tulable  cxp^imaital  date  for  environmental 
pollutants  makfis  it  desirable  to  develop  methods  for  sdecting  diemical  analog,  using  nonempurical 
variables,  whidi  are  computed  directly  from  molecular  structure  (Basak  ^d  Grunwald,  fo  Press). 

In  recent  years,  we  have  develop^  several-  methods  for  quantifying  intomolecular  sunilarify. 
Such  methods  are  based  on  topological  indices  (ITs)  and  substoctural  variables,  like  atom  pairs. 
Us  are  numerical  graph  invariants  that  encode  information  like  size,  shape,  branching  paftmn, 
symmetry  and  certain  aspects  of  stereo-electronic  fectors  associated  wife  molecules  (see  H  symbols 
and  definitions  in  Table  3.)  Topological  parameters  can  be  useftil  in  predicting  physicochemical  as 
well  as  biological  properties  of  many  different  congeneric  sets  of  molecules  (Basak  1988;  Basak 
and  Grunwald  1993;  Basak  etal.  1982,  1983,  1984,  1986,  1987a,  1987b,  1990,  1991;  Kierand 
Hall  1986;  Niemi  et  al.  1992;  Randid  1975).  Molecular  similarity  mefliods  based  on  substructures 
and  TIs  have  been  used  successfully  in  selecting  analogs,  in  discovering  novel  drugs  active  against 
human  immunodeficiency  virus  (HIV),  and  in  estimating  different  physicochemical  and  toxicological 
properties  (Basak  et  al.  1988,  1994,  In  Press  b;  Basak  and  Grunwald  1994,  1995a,  1995b, 
1995c,  1995d,  In  Press  a;  Lajiness  1990;  Wilkins  and  Randid  1980). 

In  this  paper,  similarity  methods  based  on  topological  indices  and  atom  pairs  have  been  used 
to  estimate  the  inhibitory  effects  (pIQo)  of  a  set  of  19  aliphatic  alcohols  on  microsomal 
p-hydrojgrlation  of  anilines  by  cytochrome  P^. 

Database 

E^rimentel  pIQo  values  for  inhibition  of  microsomal  cytochrome  ^430  p4iydroxylation  of  amlines 
by  nineteen  alfcanols  are  in  Table  2  ((fohen  and  Mannering  1973).  The  original  set  contained  20 
compounds;  one,  methanol,  was  ddeted.  Because  of  hs  single,  unique  atom  pair,  sinularity  of 
methanol  to  otiier  compounds  cannot  be  confuted  using  tiie  atom  pair  method. 

Computation  of  parameters 

Topological  Indices 

The  64  TIs  in  this  study  w«ce  calculated  using  POLLY  2.3  (Rble  4),  whidr  can  criculate  98  TIs 
from  SMH^  line  notation  iiqrat  of  chemical  structures  (Ba^  et  al.  1988a).  TIs  include  Wioier 
index  (Wiener  1947),  connectivity  indices  (Kier  and  Hall  1986,  l^id  1975),  information  tiieoretic 
indices  defined  on  distance  matrices  of  gn^hs  (Bondiev  and  Trinajstic  1977,  Raychaudhury  et  al. 
1984),  parameters  derived  on  the  neighborhood  complexify  of  vertices  in  hydrogen-filled  molecular 
graphs  (Basak  1987,  Basak  and  Magnuson  1983,  Basak  et  al.  1980,  Roy  et  al.  1984),  path  lengths, 
and  Balaban's  J  indices  (1982, 1983,  1986). 

Methods  for  calculating  a  few  TIs  used  in  this  paper  follow.  The  Wiener  ind«  W,  the  first 
topological  index  r^rted  in  tire  diemical  literature,  imy  be  calculated  from  tire  distance  matrix 
D(G)  of  a  hydrogtb-suppressed  chemical  gra^h  G  as  tire  sum  of  tire  ratries  in  foe  upper  triangular 
distance  submatrix.  The  distance  matrix  D(G)  of  a  nondirected  gr^h  G  with  n  veirices  is  a 
^mmetric  nxn  matrix  with  elements  d^  equal  to  foe  distance  between  vertices  Vj  and  Vj  in  G.  Each 
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Table  2.  19  Alkanols  and  their  observed  and  predicted  inhibition  of  microsomal 
p-hydroxylation  of  anilines  (plC^,),  by  atom  pair  (AP)  and  Euclidean  distance  (ED)  methods 


Alkanol 

obsL 

cst  pICjo 

—  AP  method 

cst.  pICjo 

— -  ED  method 

PIC50 

1 

2 

3 

4 

5 

I 

2 

3 

4 

5 

Ethanol 

-MO 

-0.48 

-0.48 

-034 

-034 

-0.15 

-0.48 

-0.48 

-033 

4)34 

-032 

l“Propanol 

-0.48 

-0.05 

■0.05 

-0.15 

-030 

-038 

-0.05 

-030 

-0.04 

-0.13 

•0.12 

l-Butanol 

-0.05 

037 

037  . 

0.02 

4).ll 

0.02 

-0.48 

-0.42 

-030 

-0.16 

-0.16 

1-Pfentanol 

0.27 

0.54 

0.54 

034 

005 

031 

-0.07 

-0.06 

-030 

-0.24 

-0.08 

1-Hexanol 

0.54 

0.68 

0.68 

0.54 

0.48 

0.26 

0.25 

0.26 

0.02 

•  -0.01 

4).0l 

1-Heptanol 

0.68 

0.54 

034- 

0.45 

0.41 

0.23 

0.54 

0.40 

0-14 

0.01 

0.06 

2-MethyH  -propanol 

-0.39 

-0.15 

-0.17 

-0.16 

-0.17 

-0.45 

-0.37 

-0.28 

-0.30 

-0.24 

-0.29 

2-Methyl-l  -butanol 

-0.15 

-0.05 

-032 

-0.16 

-0.22 

-0.27 

-0.07 

-0.13 

-0.10 

-0.17 

-031 

3^cthyl-l-butanol 

-0.19 

-0.07 

-0,07 

-0.18 

-033 

-0.10 

-0.37 

-0.26 

-0.30 

'  -0.24 

-0,21 

2,2-Dtmcthyl-l-propanoI 

-0.67 

-0.39^ 

-039 

-0.42 

-0.35 

-0.52 

-0.47 

-0.43 

-0.40 

-0.24 

-0,29 

2-Propanol 

-0.47 

-0.39 

-037 

-0.54 

-0.46 

-038  . 

-039 

-0.75 

-0-66 

-0.58 

-0.47 

2-Butanol 

-0.35 

-0.15 

-0.15 

-0.12 

-0.11 

-0.18 

-0.07 

-0.06 

0.05 

-0.08 

-0.10 

2-Pentanol 

-0,07 

0.15 

0.15 

-0,06 

•  -0.16 

-0.18 

-035 

-0.04 

4).04 

-0.07 

-0.15 

2-Hexanol 

0.15 

035 

035 

0.14 

0.09 

-0.01 

-0.47 

-0,10 

-0.09 

-0.01 

0.10- 

2-H^tanol 

0.25 

0.15 

0.15 

038 

035 

031 

034 

0.04 

0.11 

0.12 

0.08 

3-Pentanol 

-0.37 

-0.47 

-0.68 

-0.61 

-0.68 

-039 

-0.19 

-0.29 

-034 

-030 

-0.23 

3-Hexanol 

-0.47 

-0.07 

-032 

-0.17 

-032 

-0.04 

0.15 

031 

0.12 

032 

0.11 

2-M«thyl-3-pentanol 

-0.89 

-138 

-138 

-1.04 

-0.88 

-0,51 

-0.07 

-0.11 

-0.14 

-032 

-035 

2,4-Dimc(fayl-3-pantanol 

-138 

-0.89 

-0.89 

-0.72 

-0.64 

-036 

-0.89 

-0,89 

-038 

-0,45 

-0.40 

diagonal  dement  dg  of  D(<j5  is  zero.  We  give  below  the  distance  matrix  D(Gi)  of  the  labeled 
hydrogen-suppressed  graph  Gi  of  1-butanol  (Figure  1): 


Figure  1.  Hydrogen-suppressed  graph  of  1-butanol 


1 

2 

D(G0  =  3 

4 

5 
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W  is  calculated  as; 


W  =  V4  S  <L  =  Xh  •  gk  (1) 

U  h 

where  gi,  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h. 

Infbrmation-theoredc  topological  indices  are  calculated  by  tiie  application  of  mformation  theory 
on  chemical  graphs.  An  appropriate  set  A  of  n  elements  is  derived  from  a  molet^ar  graph  G, 
d^ending  on  certmn  structural  characteristics.  On  the  basis  of  an  equivalence  relation  defined  on 
A,  tile  set  A  is  partitioned  into  disjoint  subsets  Aj  of  order.  Uj  ~  ^  probability 

distribution  is  then  assigned  to  the  set  of  equivalence  classes; 

A  j,  A2, . >  Aj, 

Pl>  P2>  . .  Ph 

where  p;  =  njn  is  the  probability  that  a  randomly  selected  element  of  A  wUl  occur  in  the  i*  subset. 
The  mean  infiirmation  content  of  an  element  of  A  is  defined  by  Shannon’s  (1948)  relation: 

IC  =  -  X  Pi  logz  Pi  (2) 

i=l 

The  logarithm  taken  at  base  2  measures  information  content  in  bits,  and  set  A  is  then  n  times  IC 
To  account  for  the  chemical  nature  of  vertices  and  their  bonding  pattern,  Sarkar  et  al.  (1978) 
calculated  information  content  GQ  of  chemical  graphs  on  an  equivalence  relation,  whwe  two  atoms 
of  the  same  element  are  considered  equival^t  if  thqr  possess  an  identical  first-order  topological 
neighborhood.  Since  properties  of  atoms,  or  reaction  centers,  are  often  modulated  by 
physicodiemical  characteristics  of  distant  neighbors,  tiiat  is,  neighbors  of  neighbors,  it  was  deemed 
ess^tial  to  extend  tiiis  approach  to  account. for  hi^^-order  neighbors  of  vertices.  This  can  be 
accomplished  by  defining  open  sph^es  for  all  vertices  of  a  diemical  graph.  One  can  construct  such 
op^  spheres  for  hi^ef  integral  'v^ues  of  r.  For  a  particular  value  of  i;  the  collection  of  all  such 
open  spheres  S(v,r),  where  v  runs  over  the  whole  vertex  set  V,  forms  a  neighborhood  qrstem  of 
the  vertices  of  G.  A  suitably  defined  equivalence  relation  can  tiien  partition  V  into  disjoint  subsets 
consisting  of  topologies  neighborhoods  of  vertices  of  tq)  to  ordw  neighbors. 

This  approach  has  been  used  to  generate  flie  indices  of  neighborhood  symmetry.  In  this  mediod, 
chemical  species  are  symbolired  by  wei^ted  linear  graphs.  Two  vortices  u^  and  v*  of  a  molecular 
graph  are  sSd  to  be  equivalent  with  respect  to  tiie  order  neighborhood  if,  and  only  if, 
corresponding  to  each  pafli  u,,  Uj,  ....  u,  of  length  i;  there  is  a  distinct  pafii  v,,  Vj,  ...,  v,  of  the 
fiaiuft  length,  such  that  the  patiis  have  similar  edge  wei^ts,  and  bofii  u,  and  v,  are  connected  to  tiie 
number  and  type  of  atoms  up  to  tiie  i*  order  bonded  neighbors.  The  detailed  equival^ce 
rdation  is  described  in  our  earlier  studies  ^toy  et  S.  1984). 

Once  partitioning  of  file  vertex  set  for  a  particular  ord^  of  neighborhood  is  completed,  IQ  is 
calculated  from  Equation  2.  Subsequent  information  theoretic  invariants  indude  structural 
information  conteat  (SIQ)  shown  in  Equation  3  CBasak  et  al.  1980)  and  complementary  information 
content  (QQ)  shown  in  Equation  4  ^asak  and  Magnuson  1983).  In  both  equations,  n  is  tiie  total 
number  of  vertices  of  the  gnph: 


SIQ  =  ICyiogjU 
CIQ  =  logjn  -  IQ 


(3) 

(4) 
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Table  3.  Topological  index  symbols  and  definitions.  _ _ _ 


iw 

rW 

*^D 

w 

p 

IC 

O 

Iqrb 

Oqrb 

M, 

M2 

IC. 

SIC. 

ac. 

Vpc 

V 

kv'' 

XpC 

Ph 

J 

P 

P 


Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a  graph 
Mean  information  index  for  tbe  magnitude  of  distance 

Wiener  index  *=  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a  graph 
D^ree  conqilexi^ 

Gr^h  vertKC  con^lexity 
Graph  distance  complexity 

Information  content  of  the  distance  matrix  pardtiohed  by  frequency  of  occurrences  of  distance  h 

Older  of  neighborhood  when  IC,  reaches  its  maximum  value  for  the  hydrogen-filled  graph 

Information  content  or  complexity  of  die  hydrogen-suppressed  graph  at  its  maximum  neighboiliood  of 
vertices 

Maximum  order  of  neighborhood  of  vertices  for  within  the  hydrogen-suppressed  graph 
A  Zagreb  group  parameter  ==  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  ail  neighboring  (connected)  vertices 

Mean  information  content  or  con^lexity  of  a  grsqih  based  on  die  (r  =  0-4)  order  neighhorhood  of 
vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  i***  (r  «=  0-4)  order  nei^borhood  of  vertices  in  a  hydrogen-filled  gr^h  . 

Conqilementaiy  information  coiitent  for  i***  (r  =  0-4)  order  neighborhood  of  vertices  in  a  hydrogen-filled 
gi^h 

Padi  conncctiidty  ind^  of  order  h  =  0-6 

Cluster  connecti^ty  index  of  order  h  *=  3-5 

Path-cluster  connectivity  index  of  order  h  «=  4-6 

Valence  paA  connectivity  ind^  of  order  h  *=  0-6 

Valence  cluster  connectivity  of  order  h  «=  3-5 

Vdence  path-cluster  connectivity  index  of  order  h  =  4-6 

Number  of  paths  of  length  h  =  0-7 

6alaban*s  J  index  based  on  distance 

Balaban’s  J  index  based  on  relative  electron^ativities 

Balaban’s  J  index  based  on  rdative  covalent  radii 
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The  information-theoretic  indoc  on  graph  distance.  Id  is  calculated  from  the  distance  matrix 
D(G0  of  a  dimical  graph  G  by  the  method  of  Bondiev  and  Trinajstid  (1977): 

I^=  W  log2  W  -  Xgh  •  h  log2  h  (5) 

h 


The  mean  information  index,  is  found  by  dividing  the  information  index  IJf  l^  W. 

Indices  developed  ly  A.T.  Balaban  (1982,  1983,  1986)  were  calculated  and  used  in  this 
analysis.  Balaban  denoted  these  as  J  indices,  which  are  based  upon  the  distance  sums  S;  of  a 
chemical  graph.  J  is  defined  as: 


J  =  q(M  +  l)-» 

IJ  C<%& 


(6) 


where  the  cyclomatic  number  n  (or  number  of  rings  in  the  grq»h).  is  ft  =  q  -  n  +  1,  with  q 
adjacencies  or  edges  and  n  vertices.  In  the  original  definition  of  J,  the  term  Sj  referred  to  either  the 
row  distance  sum  for  vertex  i  in  the  distance  matrix  (D)  or  the  multigraph  distance  matrix  (M): 

Si  =  Xd^  (7) 

For  distance  matrix  D,  each  matrix  dement  dij.n^resaits  die  distance  from  vertex;  i  to  vertex  j.  The 
diagonal  ratries  are  all  zero,’  and  the  distance  between  any  two  adjacent  vertices  would  be  one.  All 
other  entries  in  tiiis  matrix  would  be  the  numb^  of  edges,  or  bonds  traversed  in  the  shortest  path 
from  i  to  j.  To  account  for  the  periodicity  of  chemical  prop«ties  for  heteroatoms,  Balaban  proposed 
two  J  variants:  which  includes  corrections  for  heteioatom  electron^tivities,  and  J^,  whidh  has 
corrections  for  heteroatom  covalent  radii  (Balaban  1986). 

Atom  Burs 

Atom  pmis  were  calculated  using  the  metiiod  of  Caihart  et  al  (1985).  An  atom  pair  is  defined  as 
a  substructure  consisting  of  two  non-ltydiog^  atoms  i  and  j  and  tiieir  intaafomic  s^aration: 

<atom  des<aiptori>-<s^aration>-<atom  desa:iptorj> 

wh^  <atom  descriptot-  >  contains  information  about  the  d^nent  type,  number  of  non-hydrogen 
neighbors  and  tiie  numbW  of  x  dectrons.  Int^iatonuc  s^ar^on  of  two  atoms  is  tiie  number  of 
atoms  traversed  in  tiie  shortest  bond-by-bond  padi  contaiiui^  bodi  atoms. 

Elgute  2  dononstrates  the  calculation  of  atom  pmrs  for  1-butanol.  1-Butanol  has  10  total  atom 
pairs,  9  of  whidi  are  umque.  In  lugure  2,  (n  «=  1  or  3  in  tins  example)  rq>res^ts  tiie  number 
of  non-ltydrog^  neighbors  and  tiie  C  and  O  are  atomic  symbols.  These  are  tiie  elm^ts  of  the  atom 
desoiptors.  The  *-k-"  ,  k  =  2,  3, 4,  and  5  are  tiie  s^aration  values. 

The  first  atom  pair  (CXi  -  2  -  CXj)  Oorresponds  to  patii  ab,  a  mdltyl  carbon  with  1  non- 
hydrog^  neighbor  bonded  and  a  mdhyl  carbon  with  2  non-hydrogen  nei^bors.  The  patii  length 
of  oh  is  2.  Path  be  and  cd  are  id^tical;  each  consists  of  a  path  of  lengtii  2  which  joins  two  methyl 
carbons,  each  of  which  has  two  non-ltydrogen  nei^bors.  Hence,  tills  atom  pmr  has  a  frequ^cy  of 
2.  Patii  de,  involving  a  methyl  carbon  and  the  oxygen  of  the  hydinxyl  group,  defines  atom  pair  3. 


499 


Use  of  GraphTheoretic  Parameters  in  Predicting  Microsomal  p-Hydroxylation  Inhibition 
Figure  2.  Determination  of  atom  pairs  for  1-butanol 

1-Butanol 

aB-CH2-CH2-CH2-OH 
a  b  c  d  e 


Atom  Pair 

Freq.  of  Occurence 

Path 

1.  CX,  -  2  -  CXj 

1 

ab 

2.  CXj  -  2  -  CXj 

2 

be,  cd 

3.  OCj  -  2  -  OX, 

1 

de 

4.  CX,  -  3  -  CXj 

1 

abc 

5.  CXj  -  3  -  CXj 

1 

bed 

6.  CXj  -  3  -  OX, 

1 

ede 

7.  CX,  -  4  -  CXj 

1 

abed 

8.  CXj  -  4  -  OX, 

1 

bede 

9.  CX,  -  5  -  OX, 

1 

abede 

STATisncAL  Analysis  and  Similarity  Measures 

Data  Reduction 

Initially,  all  TIs  were  transfonned  by  the  natural  logarithm  of  the  value  of  the  index  plus  one.  This 
was  done  because  the  scale  of  some  TIs  m^  be  several  orders  of  magmtude  grater  than  others. 

Prindpal  Components  Analysis  (PCA) 

The.  data  analyzed  in  this  mty  be  viewed  as  n  (number  of  chemicals)  vectors  in  p 
(number  of  TIs)  dimensions.  The  data  for  each  set  can  be  represented  by  a  matrix  X,  which  has  n 
rows  and  p  columns.  For  each  of  the  molecular  structures,  the  number  of  calculated  parameters 
was  64  (TIs  of  Table  3).  Each  chemical  is  therefore  rqtresented  by  a  point  in  R**.  If  each  molecule 
is  rqiresented  in  R*,  dien  one  could  plot  and  investigate  die  extent  of  rdadonship  between  individual 
parameters.  In  R“  such  a  simple  analysis  is  not  possible.  However,  since  mary  of  the  TIs  are  hi^y 
intercorrelated,  die  points  in  R**  can  likely  be  rqitesented  by  a  subspace  of  fewer  dimensions.  The 
method  of  principal  componwits  analysis  (PCA),  or  the  Karhunen-Loeve  transformation,  is  a 
standard  method  for  reduction  of  dimensionality  (Gnanadesikan  1977).  The  first  principal 
component  OPQ  is  die  line  which  comes  closest  to  the  points,  in  the  sense  of  minimizing  the  sum 
of  the  squared  Euclidean  distances  from  the  points  to  the  line.  The  second  PC  is  giv^  by 
projections  onto  the  basis  vector  orthogonal  to  the  first  PC  For  points  in  R',  the  fiirst  r  principal 
components  give  the  subspace  which  comes  closest  to  tqiproximating  die  n  points.  The  first  PC  is 
the  first  axis  of  die  points.  Successive  axes  are  major  directions  ortiiogonal  to  previous  axes.  The 
PCs  are  the  closest  approximating  hyperplane,  and  because  they  are  calculated  from  Eigenvectors 
of  a  pxp  matrix,  the  computations  are  relatively  accessible.  But  there  are  important  sqding  choices, 
because  PCs  are  scale  d^endent.  To  control  this  dependence,  the  most  commonly  used  convention 
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.  X  u  hnc  mean  zero  and  standard  deviation  one.  The 

-n*  PC  on  «.e  TU  has 

been  (Sinied  out  using  SAS  software  (SAS  1989). 

Similarity  Measures 

TVw  measures  of  intermoleculM  1l995^^ Sd  ticlude  an  associative 

p’^iX'S'SiSan  dis«nce  (ED)  wiftin  a  Sv^Kmensional  PC  spac. 

K-Neighbor  Sriection 

Using  the  topologically-based  methods  (^crib^abov^Aei^mo^^ 

chemicals  was  quantified.  "11  The  mean  observed 

determined  using  die  two  simitoiQr  for  compound  The  correlation  (r)  of  observe 

p,^wT^nZ"a  STundL^rsor  (SE)  of  fte  esto^es  ««e  naed  n,  assess  (he 
relative  eflicaqr  of  the  two  similarity  methods. 

Results 

d.e  PO.  of  64  Tls  (bs  19  ~ 

cumuladvely,  94.5%  of  the  to^.varia  on  ,  djg  cunnilative  vatiance  explained,  and 

ave  PCs,  the  ptoportion  <>t 'rr^nTc'^eSsfre^ll^tSaKd  with  paratLent  which 
the  two  Tls  most  cotielated  with  each  PC  We  tat  PC  is  sttMgl^  n  correlated 

characterise  the  sire  of  die  mol^ar  graph  so*  ^  die  ttW  PC  the  hipest 

with  the  hi^er  order  compiexity  indices  ^  “d.  as  IC.^  IC,.  We 

correiations  occur  with  l<w  0^“  oadniltistet^d  iSence  palh<ltis(et  connectivity 

mdices,  ^nd  x  k:»  ^  ^  ^  AYnectations  based  on  previous  research 

v^Tr?^r'J^T99SlT99^^^S^  - 

(BasakandGrunwald  1994,  1995a,  ^  ..  .  complexity  mdices,  PC,  with 

particular  case,  we  see  PG,  and  PQ  reversed. 

Wble  4.  Stmmnuy  f  «>«  Pri»dpal  component  analysis  ff^)  using  44  Tls  tor  19  tdknnob. 


Eigenvalue  96  Variance 


CumulatiTie  96  1st  Correlated  Tl  2nd  Correlated  TI 


54.2 

25.5 

7  7 

54.2 

79.7 

87.4 

Po 

aCg 

ICj 

0.997 

0.932 

•0.603 

X 

SIC« 

5.5 

92.9 

X’pc 

4).565 

J(PC 

1.6 

94.5 

P7 

-0.444 

Xpc 

0.997 

■0.931 

0.595 

■0.543 

0.265 


TI:  Tbpological  indices 
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Table  2  presents  estimates  of  pICjo  for  each  similarity  method  at  each  K-level  (K=  1-5).  The 
atom  pair  method  gave  the  best  overall  results.  Tlie  AP  standard  errors  fell  within  the  range  of  the 
PC  standard  errors,  and  the  correlations  were  all  10%  to  25%  higher.  The  best  correlation  for  the 
atom  pair  method  was  0.878  for  K=l.  It  should  be  noted  that  results  for  K  of  1,  2  ,  and  3  were 
all  very  close,  within  0.013  units  for  correlation  and  standard  errors  of  within  0.01  -log  units. 

Table  5  r^orts  the  correlation  and  standard  errors  of  pICso  estimates  with  observed  values 
for  both  the  atom  pmr  (AP)  and  the  principal  component  (PQ  methods.  Each  line  of  the  table 
r^resents  a  different  K  level.  The  standard  error  for  estimation  was  at  its  minimum  of  0.17  -log 
units  for  the  PC  method  with  K=5.  The  correlation,  however,  was  at  its  maximum  of  0.878  using 
the  AP  method  with  K=l. 

Table  5.  -Log  ICjo  estimation  for  alkanols  by  K-nearest  neighbors  using  atom  pair  (AP)  and 
Euclidean  distance  (HO)  similarity  approach^ 


K  ■ 

AP  Method 

r 

SE 

ED  Mcth(xi 

r 

SE 

1 

0.878 

0,26 

0.661 

0,36 

2 

0.865 

0.27 

0.707 

3 

0.871 

0.26 

0.595 

0.23 

4 

0.855 

0.29 

0.566 

0.19 

5 

0.811 

0.34 

0.638 

0,17 

n  Correladon 
SE:  Standard  error 


Discussion 

The  objective  of  diis  study  was  to  investigate  die  utility  of  nonempirically  based  molecular  similarity 
methods  in  estimating  die  inhibitory  potOTcy  ^ICso)  of  t'’  Efoup  of  aliphatic  alcohols  for  microsomal 
p^ijidroxylation  of  aniline.  The  result  shows  that  the  atom  pair  method  of  quantifying  similarify 
gives  a  reasonable  estimate  of  pICso  values  of  alkanols  (Thble  2). 

It  is  evident  from  an  analysis  of  results  in  Table  5  that  the  AP  method  is  superior  to  the  ED 
method  in  predicting  pIQo  values.  This  is  true  for  K  =  l-5i  This  indicates  diat  atom  pairs  quantify 
structural  aspects  of  alkanols,  relevant  .to  inhibition  of  aniline  p-hydroiylation  by  microsomal 
cytodirome  P450,  better  than  the  Euclidean  space  derived  from  the  calculated  numerical  graph 
invariants.  Further  work  is  in  progress  to  determine  the  relative  effectiveness  of  AP  vis-a-vis  ED 
methods  in  estimating  physicochemical  as  well  as  toxicological  properties  of  chemicals. 
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A  molecular  similarity  measure  has  been  used  to  estimate  the  nor¬ 
mal  boiling  points  of  a  set  of  267  haloalkanes  with  1^  carbon  at¬ 
oms.  Molecular  similarity/dissimilarity  was  quantified  in  terms  of 
Euclidean  distances  of  molecules  in  the  eight  dimensional  principal 
component  space  derived  from  fifty-nine  topological  indices.  Corre¬ 
lation  coefficients  between  the  experimental  and  estimated  boiling 
points  ranged  from  0.854  to  0.943  in  the  if-nearest  neighbor  esti¬ 
mation  of  boiling  points  using  a  different  number  of  nearest  neigh¬ 
bors  {K  =  1-10,  15,  20,  25), 


INTRODUCTION 

The  use  of  structural  analogy  as  a  tool  to  classify  chemicals,  as  well  as 
predict  the  behaviour  of  chemical  species,  is  as  old  as  chemistry.  In  1819, 
Mitscherlich^  described  the  phenomenon  of  isomorphism,  in  which  substitu¬ 
tion  of  one  atom  by  another  leads  to  similar  lattice  structures.  At  the  turn 
of  this  century,  Langmuir^  observed  that  isosteric  chemical  species,  those 
which  contain  the  same  total  number  of  atoms  and  electrons,  have  very 
similar  properties.  Members  of  isosteric  pairs,  like  N2-CO  and  N2O-CO2, 
have  many  similar  physical  constants.^  The  structural  similarity  of  the  isos¬ 
teric  amino  acids  valine  and  threonine  poses  some  interesting  problems  in 
the  protein  synthesis  mechanism  of  cells.  Being  sterically  similar,  valine  and 
threonine  may  be  charged  to  the  same  tRNA.  The  incorrectly  formed  ami- 
noacyl  adenylate  and  aminoacyl  tRNA  are  discriminated  and  destroyed  via 
a  »double  sieve«,  involving  steric  exclusion  and  ineffective  binding,  before 
they  are  used  in  protein  synthesis.*^ 
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Similarity  plays  an  important  role  in  biological  activity.  The  enzyme  di¬ 
hydrofolate  reductase  normally  facilitates  the  reduction  of  dihydrofolate  to 
tetrahydrofolate.  Methotrexate,  a  compound  whose  structure  is  similar  to  di¬ 
hydrofolate,  inhibits  the  action  of  the  reductase.^  Competitive  inhibition  of 
enzymes  can  also  result  from  interaction  of  the  enzyme  with  transition  state 
analogs  of  the  substrate.  For  example,  proline  racemase  from  Clostridium 
sticklandii  preferentially  binds  the  transition  state  of  proline.  As  a  result, 
the  racemase  is  subject  to  inhibition  by  compounds  which  are  structural 
analogs  of  the  transition  state  of  proline,  such  as  pyrrole-2-carboxylate  and 
pyrroline-2-carboxylate,  which  bind  to  the  enzyme  with  a  much  greater  af¬ 
finity  than  does  proline.^  Furthermore,  the  structural  similarity  between  a 
macromolecular  biotarget  and  its  antiidiotypic  antibody  is  believed  to  be  the 
reason  for  the  use  of  such  antibodies  as  model  receptors  in  the  screening  of 
chemicals  for  drug  discovery.^ 

The  last  decade  has  seen  an  upsurge  of  interest  in  the  development  of 
similarity  measures  and  their  applications  in  chemical  research,  drug  de¬ 
sign,  and  toxicology. Such  methods  are  based  on  different  representations 
of  chemical  species,  viz,,  topological,  geometrical,  quantum  chemical,  etc.  In 
drug  design,  similarity  searching  of  databases  is  used  to  identify  potential 
leads.  Also,  dissimilarity  based  methods  are  used  to  select  chemicals  for 
screening  in  the  drug  discovery  process. In  toxicolog>^  structural  and  func¬ 
tional  analogy  are  used  to  assess  the  ecological  and  human  health  risk  of 
the  new  and  existing  chemicals.^^^® 

In  the  United  States,  the  majority  of  chemicals  submitted  to  the  Envi¬ 
ronmental  Protection  Agency  (USEPA)  for  registration  do  not  have  any  test 
data.^^  One  of  the  methods  used  by  regulators  for  the  hazard  assessment  of 
such  chemicals  is  to  select  their  analogs  and,  subsequently,  estimate  the 
hazard  of  the  chemical  of  interest  from  the  hazard  of  the  analogs.  Such  se¬ 
lection  of  analogs  is  often  done  subjectively  by  individual  experts  on  the  ba¬ 
sis  of  an  intuitive  notion  of  similarity.  In  USEPA’s  approach  to  ecological 
risk  assessment,  class  specific  QSARs  are  preferred  over  the  use  of  analogs, 
although  in  human  health  hazard  assessment,  analog-based  estimation  of 
toxic  potential  is  still  the  most  important  factor.^® 

Rapid  selection  of  analogs  for  drug  design  and  hazard  assessment  re¬ 
quires  automated  methods  that  are  computationally  feasible.  Similarity 
methods  based  on  parameters  that  can  be  calculated  directly  from  molecular 
structure  fall  into  this  category.®"^®  Topological  indices  derived  from  a  mo¬ 
lecular  graph  comprise  a  set  of  parameters  which  can  be  computed  for  any 
chemical  structure.^® 

In  some  of  our  recent  studies,  we  have  developed  novel  methods  of  quan¬ 
tifying  molecular  similarity  using  topological  indices  and  substructural  fea¬ 
tures  like  atom  pairs. We  have  applied  similarity  techniques  in  the  se¬ 
lection  of  analogs  and  in  the  estimation  of  molecular  properties  such  as 
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In'this  P^i^r,  “'■'‘''"'“als 

mal  boiling  point  for  a  set  of  267  chSu™ar?onsTcTOs?  ' 

lanty  method  based  on  topological  indices  “  "‘™- 


DATABASES 

267  CFctTitri-!4''Srbon  ato^ 

Beilstein’s  Handbuch  der  Organischen  Chende  collected  from 

istry  and  Physics  Heilbrnn’c  D'  of  Chem- 

and  Srivastava’s  Thermodynamic  Compounds  and  Smith 

in  several  studies  by  Balaban  et  al  For  Compounds  Part  B  for  use 
CFC’s^o  was  further  redu^d  to  267  J  P^^rposes,  the  subset  of  276 
whose  normal  boiling  points  were  more  outliers.  Nine  compounds 

the  mean  boiling  point  of  fh!  T  "  standard  deviations  from 

which  would  give  reasonable  estimates  of  bodilg  p^im  Tabiri  T®  rZ"" 
of  the  compounds  used  m  this  study  and  their  nolaTbStog  pits  " 


METHODS 


'calculation  of  Topological  Indices 

ing  po\1y  gllSTs?  “SimrVr 

structures.®^  The  TIs  calculated  a  r  f  a  •  input  of  chemical 

index  calculated  by  the  mf  '-*"de  the  Wiener 

lated  by  Randic®^  anX  ^ 

fined  on  distance  matricL  of  .n-a^bc  ’  t“®™3tion  theoretic  indices  de- 
najstic®®  as  well  as  those  of  ^  ^b  methods  of  Bonchev  and  Tri- 

thi  neighbourhood  XplexXXt^^^  u  derived  on 

rraphs/-.  path  lenlTlZUa 


uaia  tieduction 

The'naS  *''7“^'  "  P‘-  -e, 

several  orders  of  magnitude  greater  thaloth7  ^s  may  be 

log  transformation  since  many  of  the  TTc  added  before  the 

analysis  (PCA)  was  used  To  reducXe  d  component 

logical  indices  (TIs)  With  PCA  linea  set  of  59  topo- 

uis;.  with  PCA.  linear  combinations  of  the  TIs,  called  pri^n- 
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TABLE  I 

Normal  boiling  points  of  267  haloalkanes  with  1-4  carbon  atoms 


No. 

Chemical  Name 

Normal 

Boiling 

Point 

Est. 

Boiling 

Point 

Residual 

Boiling 

Point 

1 

carbon  tetrachloride 

76.7 

33.3 

43.4 

2 

trichloromethane 

61.2 

28.6 

32.6 

3 

dichloromethane 

39.8 

7.1 

32.7 

4 

trichlorofluoiomethane 

23.7 

1.3 

22.4 

5 

dichlorofluoromethane 

8,9 

4.3 

4,6 

6 

chlorolluoromethane 

-9.1 

3.5 

-12.6 

7 

chloromethane 

-24.3 

-9.3 

-15.0 

8 

dichlorodilluoromethane 

-29.8 

6.9 

-36.7 

9 

chloroditluoromethane 

-40.8 

-8.4 

-32.4 

10 

ditluoromethane 

-51.7 

12.0 

-63.7 

11 

hexachloroethane 

184.4 

146.6 

37.8 

12 

1,1, 1,2, 2-pen  tachloro-2-fluoroethane 

137.9 

136.2 

1.7 

13 

1,1,1 ,2- te  trachloro-2-fluoroe  thane 

117.0 

106.9 

10.1 

14 

1,1,2,2-tetrachloro-l-nuroethane 

116.6 

97.7 

18.9 

15 

1, 1,2-trichIoroethane 

113.7 

78.7 

35.0 

16 

1 , 1 .2-trichloro-2-nuoroethane 

102.4 

68.8 

33.6 

17 

l,l,2,2-tetrachloro-l,2-dinuoroethane 

92.7 

55.0 

37.7 

18 

1, 1, 1-trichloroethane 

74.0 

25.2 

48.8 

19 

1,2-dichloro-l-nuoroethane 

73.8 

71.4 

2.4 

20 

l,l,2-trichloro-l,2-dinuoroethane 

72.5 

60.4 

12.1 

21 

l,2-dichloro-l,2-ditluoroethane 

58.5 

33.9 

24.6 

22 

1,1-dichloroethane 

57.2 

4.8 

52.4 

23 

1,1 ,  l-trichloro-2,2.2-trifluoroethane 

45.8 

57.5 

-11.7 

24 

l,2-dichloro-l,l-dinuoroethane 

46.6 

43.8 

2.8 

25 

2-chloro- 1 , 1-difluoroethane 

35.1 

57.6 

-22.5 

26 

1,1-dichloro-l-tluoroethane 

32.0 

15.2 

16.8 

27 

2,2-dichloro-l,l,l-trifluoroethane 

28.7 

42.3 

-13.6 

28 

1-chloro-  1-fluoroethane 

16.1 

32.6 

-16.5 

29 

chloroethane 

12.3 

7.6 

4.7 

30 

1-chloro- 1 , 1,2-trifluoroethane 

12.0 

20.0 

-8.0 

31 

2-chloro- 1,1,1-tnfluoroethane 

6.9 

21.1 

-14.2 

32 

l,l,2trilluoroethane 

5.0 

27.7 

-22.7 

33 

l,2-dichloro-l,l,2,2-tetrafluoroethane 

3.6 

40.6 

-37.0 

34 

2,2-dichloro-l,l,l,2-tetrafluoroethane 

3.6 

29.1 

-25.5 

35 

l-chloro-l,l,2,2-tetrafluoroethane 

-12.0 

24.1 

-36.1 

36 

1,1,2,2-tetrafluoroethane 

-22.8 

63.6 

-86.4 

37 

1, 1-difluoroethane 

-25.8 

27.8 

-53.6 

38 

1,1,1,2-tetrafluoroethane 

-26.1 

-1.1 

-25.0 

39 

fluo  roe  thane 

-37.8 

17.6 

-55.4 

40 

r,  1, 1-trifl  uo  roe  thane 

-47.3 

-8.5 

-38.8 

41 

1,1, 1,2, 2-pen  tafluoroethane 

-48.3 

-14.4 

-33.9 

42 

1,1,2,2,3,3-hexachIoropropane 

218.5 

201.4 

17.1 

43 

1,1,1,2.2,3-hexachloropropane 

218.0 

199.9 

18.1 

44 

1,1,1, 2,3, 3-hexachloro-2,3-difluoropropane 

196.0 

183.8 

12.2 

45 

l,l,l,2.2,3-hexachloro-3,3-dinuoropropane 

193.4 

197.4 

-4.0 
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TABLE  I 

(continuing) 


No.  Chemical  Name 


46 

47 

48 


49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 


79 

80 
81 
82 

83 

84 

85 

86 

87 

88 

89 

90 


1.1. 1.2. 2- pen  tachloro-3,3-dinuoropropane 

1.1.1.3.3- pentachIoro-2,2-dilluoropropane 

1.1.2.2- tetrachloropropane 

1. 1.3.3- tetrachloropropane 

1.2.3- tnchloropropane 

1. 1.2.3.3- pentachloro- 1 ,2,3-trifluoropropane 

1. 1 .2.2- tetrachioropropane 

1.1.2.2.3- pentachloro-l,3,3-tnfluoropropane 

1.1.1.3- tetrachIoro-2,2-dinuoropropane 

1.1.1.2- tetrachloropropane 

1.1.2.2- tetrachloro-3,3-dinuoropropane 

1.1.3- trichloropropane 

1.1.2.2- tetrachioro-l,3,3-trifluoropropane 

1.2.3- trichloro-2-fluoropropane 

1.1.2.3- tetrachloro-2,3,3-trifluoropropane 

1.1.3- trichloi*o-2,2-difluoropropane 
l»l»2,2-tetrach  loro-3, 3, S-trifluoropropane 

1.1.2- trichloro-2-fluoropropane 

1.1.3.3- tetrachloro-l,2,2,3-tetrafluoropropane 

1.1. 1.3- teti  3ch  loro-2, 2, 3, 3- te  trail  uoro  propane 

1.1.2- trichloro-l-nuoropropane 

1.1.1.2- tetrachloro-2,3,3,3-tetrafluoropropane 

1.1.2.2- tetrachloro-l,3,3,3-tetrafluoropropane 

1.2 .2.3- tetrachloro- 1 , 1 ,3 ,3- tetrafluoropropane 

1.1.3- trichloro-l,2,2-tri{luoropropane 
1, 1,  l-trichloropropane 

1.2.2- trichloro-3,3,3-trifluoropropane 

1.3- dichloro-2,2-difluoropropane 

1.3.3- trichloro-l,l,2,2-tetralluoropropane 

1.2.2- trichloro-l,l-difluoropropane 

1.2.3- trichIoro-l,l,2,3-tetrafluoropropane 

2.3- dichloro-l,l,2,3-tetrafluoropropane 

1.2.3- trichloro-l,  1,3,3-tetralluoropropane 
l-chloro-3-fluoropropane 

1.2.3- ti  ichIoro-l,l,2,3,3-pentalluoropropane 

2.3.3- trichloro-l,l,1.2,3-pentalluoropropane 

1.3.3- trichloro-l,l,2,2,3-pentalluoropropane 

2.2.3- trichIoro- 1,1,1, 3,3-pen  tafluoropro  pane 

1.2- dichloro-l,l-dinuoropropane 

2.2- dichloropropane 
l-chloro-2-lluoropropane 

1.1- dichloro-l-iluoropropane 
1 -ch  loro-2, 2-difluoropropanc 
1-ch  loro-  1,2-din  uoropropane 

2.2- dichloro-l,l,  1-trilluoropropane 


bo  , 

-  ^ 
rs  —  • 

o  ‘o  rg 

:z:  ra  ^ 

“p 

a:  - 

175.0 

178.0 

-3.0 

174.0 

147.8 

26.2 

165.5 

173.1 

-7.6 

161.9 

170.6 

-8.7 

156.8 

153.4 

3.4 

154.7 

128.6 

26.1 

153.0 

151.6 

1.4 

152.3 

156.8 

-4.5 

151.2 

130.7 

20.5 

150.4 

152.1 

-1.7 

147.6 

132.1 

15.5 

145.5 

140.9 

4.6 

134.6 

126.9 

7.7 

130.8 

111.9 

18.9 

129,8 

106.8 

23.0 

127.3 

99.2 

28.1 

126.2 

132.3 

-6.1 

116.7 

99.2 

17.5 

114.0 

105.0 

9.0 

113,9 

104.3 

9.6 

113.5 

99.9 

13.6 

112.5 

121.2 

-8.7 

112.3 

125.2 

-12.9 

112.2 

113.4 

-1.2 

109.5 

87,2 

22.3 

108.0 

105.9 

2.1 

104.5 

120.8 

-16.3 

96.7 

93.6 

3.1 

91.8 

90.2 

1.6 

90.2 

86.5 

3.7 

90.0 

84.5 

5.5 

89.8 

72.2 

17.6 

88.0 

76.7 

11.3 

81.0 

101.7 

-20.7 

73.7 

65.4 

8.3 

73.4 

65.4 

8.0 

73.0 

65.5 

7.5 

72.0 

80.6 

-8.6 

70.0 

82.1 

-12.1 

69.3 

34.3 

35.0 

68.5 

70.7 

-2.2 

66.6 

77.6 

-11.0 

55.1 

44.0 

11.1 

52.9 

72.8 

-19.9 

48.8 

72.0 

-23.2 
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TABLE  I 

(continuing) 


No. 

Chemical  Name 

Normal 

Boiling 

Point 

Est. 

Boiling 

Point 

Residual 

Boiling 

Point 

91 

1-chloropropane 

46.6 

43.8 

2.8 

92 

3,3-dichloro-l,  1 , 1,2, 2-pen  tafluoropro  pane 

45.5 

58.8 

-13.3 

93 

1,3-difluoropropane 

41.6 

88.8 

-47.2 

94 

2-chloropropane 

35.7 

31.3 

4.4 

95 

l,3-dichloro-l,l,2,2,3,3-hexafluoropropane 

35.7 

33.8 

1.9 

96 

2-chloro-2-fluoropropane 

35.2 

25.5 

9.7 

97 

3,3-dichloro-l,l,l,2,2,3-hexafluoropropane 

35.0 

50.5 

-15.5 

98 

3-chloro- 1 , 1 , 1 ,2 , 2-pentanuoropropane 

27.6 

42.1 

-14.5 

99 

l-chloro-l,l-difluoropropane 

25.4 

50.0 

-24.6 

100 

l-chloro-l,l,2,2,3,3-hexafluoropropane 

21.0 

32.2 

-11.2 

101 

1,1,1,2,3-pentanuoropropane 

20.0 

20.6 

-0.6 

102 

1, 1,2,2,3,3-hexafluoropropane 

10.5 

11.2 

-0.7 

103 

1 , 1 -difluoropropane 

7.5 

61.1 

-53.6 

104 

1,1,1,2,3,3-hexafluoropropane 

5.0 

9.0 

-4.0 

105 

1,1,1,2,2,3-hexanuoropropane 

1.2 

5.3 

-4.1 

106 

2-chloro-l,  1,1, 2,3,3, 3-heptafluoropropane 

-2.0 

12.6 

-14.6 

107 

2-lluoropropane 

-9.7 

18.1 

-27.8 

108 

1,1,1-tritluoropropane 

-12.5 

7.1 

-19.6 

109 

1,1, 1,2,3, 3, 3-heptafluoropropane 

-19.0 

20.6 

-39.6 

110 

l-chloro-2-fluoroethane 

53.0 

48.7 

4.3 

111 

1-chloro- 1, 1-difluoroethane 

-9.8 

-1.9 

-7.9 

112 

l-chloro-l,l,2,2,2-pentafluoroethane 

-38.0 

4.8 

-42-8 

113 

1,2-dichloroethane 

83.5 

50.5 

33.0 

114 

1,  l-dichloro-2,2-difluoroethane 

60.0 

57.1 

2.9 

115 

l,l-dichloro-l,2,2-trifluoroethane 

30.2 

41.3 

-11.1 

116 

l,2-dichloro-l,l,2-trifluoroethane 

28.2 

41.7 

-13.5 

117 

1 , 1,2-trichloro-  1-fluoroethane 

88.5 

56.5 

32.0 

118 

1 , 1,  l-trichloro-2,2-difluoroethane 

73.0 

68.1 

4-9 

119 

l,l,2-trichloro-2,2-difluoroethane 

71.2 

62.3 

8.9 

120 

1, 1,2-trichloro- 1, 2, 2-trifluoroethane 

47.6 

47.4 

0.2 

121 

1 , 1 , 1 ,2-tetrachloroethane 

130.5 

98.3 

32.2 

122 

1,1,2,2-tetrachloroethane 

146.3 

61.7 

84.6 

123 

1,1,1, 2-tetrachloro-2 , 2-dif  luoroethane 

91.6 

102.0 

-10.4 

124 

1, 1,1,2,2-pentachloroethane 

161.9 

149.5 

12.4 

125 

1, 1,2,2, 3-pentachloropropane 

196.0. 

193.0 

3.0 

126 

1, 1,2,3,3-pentachloropropane 

199.0 

184.4 

14.6 

127 

l,l,2,2,3-pentachloro-3,3-difluoropropane 

168.4 

148.0 

20.4 

128 

l,l,2,3,3-pentachloro-l,3-difluoropropane 

167.4 

130.5 

36.9 

129 

l,l,l,2,2-pentachloro-3,3,3-trifluoropropane 

153.0 

172.7 

-19.7 

130 

1,1, 1 ,2 ,3-pentachloro-2,3,3-trifluoropropane 

153.3 

145.2 

8.1 

131 

l,l,l,3,3-pentachloro-2,2,3-trifluoropropane 

153.0 

137.3 

15.7 

132 

1,1,1,2,3,3-hexachloropropane 

217.0 

207.2 

9.8 

133 

1,1,1,3,3,3-hexachloropropane 

206.0 

151.7 

54.3 

134 

l,l,l,2,2,3-hexachloro-3-fluoropropane 

210.0 

201.5 

8.5 

135 

1,1,1, 2,3, 3-hexachloro-3-fluoropropane 

207.0 

178.7 

28.3 
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TABLE  I 

(continuing) 


No. 

Chemical  Name 

Normal 

Boiling 

Point 

Est. 

Boiling 

Point 

Residual 

Boiling 

Point 

136 

1,1,2,2,3,3-hexachloro-l-nuoropropane 

210.0 

185.3 

24.7 

137 

l,l,l,3,3,3-hexachioro-2,2-dinuoropropane 

194.2 

148.5 

45.7 

138 

l,l,2,2,3,3-hexachloro-l,3-ciiiluoropropane 

194.2 

181.4 

12.8 

139 

l,2-<iichloro-l,l,2,3,3-pentafluoropropane 

56.3 

65.1 

—8.8 

140 

2,3-dichloro-l ,  1 , 1 ,2,3-pentalluoiopropane 

56.0 

65.6 

-9.6 

141 

1,1,2-trichioropropane 

133.0 

94.7 

38.3 

142 

1,2,2-trichloropropane 

122.0 

103.1 

18.9 

143 

l,l,l‘trichlorO'2,2-difluoropropane 

102.0 

75.0 

27.0 

144 

l,2,2-trichlorO'l,l,3,3-tetrafluoropropane 

92.0 

99.4 

-7.4 

145 

3,3 ,3-trichloro- 1,1,1 ,2,2-pentanuoropropane 

70.5 

89.4 

-18.9 

146 

l,l,2-trichloro-l,2>dinuoropropane 

97.7 

74.0 

23.7 

147 

l,l,3-trichloro-3,3-difluoropiopane 

107.8 

70.3 

37.5 

148 

3-chloro-l,l,l,3,3-pentafluoropropane  . 

28.4 

63.7 

-35.3 

149 

2-chloro- 1, 1,1,3 ,3 ,3-hexafluoropropane 

15.5 

20.3 

-4.8 

150 

3-chloro- 1, 1 , 1,2 ,2,3 ,3-heptafluoropropane 

-~2.5 

15.0 

-17.5 

151 

3-chloro- 1 , 1 , 1,2 ,2,3-hexafluoropropane 

20.0 

22.2 

-2.2 

152 

1, 1-dichloropropane 

88.1 

91.8 

-3.7 

153 

1,2-dichloropropane 

96.0 

90.3 

5.7 

154 

1,3-dichloropropane 

120.8 

105.4 

15.4 

155 

l,2-dichloro-2-fluoropropane 

88.6 

57.5 

31.1 

156 

1,2-dichloro-l-nuoropropane 

93.0 

64.8 

28.2 

157 

l,l“dichloro-2,2-difluoropropane 

79.0 

80.3 

-1.3 

158 

l,3-dichloro~l,l-difluoropropane 

80.8 

79.6 

1.2 

159 

l,l-dichloro-l,2,2-trifluoropropane 

60.2 

62.4 

-2.2 

160 

3,3>dichloro*l,l,l-trittuoropropane 

72.4 

81.3 

-8.9 

161 

l,2-dichloro-l,l,2-trinuoropropane 

55.6 

63.1 

-7.5 

162 

2,3-dichloro-l,  1 , 1-trifluoropropane 

76.7 

99.6 

-22.9 

163 

l,3-dichloro-l,l,2,2-tetralluoropropane 

68.2 

71.8 

-3.6 

164 

2,3-dichloro-l,  1, 1,3, 3-pentafluoropropane 

50.4 

70.3 

-20.0 

165 

2,3-dichloro- 1,1,1 ,2,3 ,3-hexafluoropropane 

34.7 

40.0 

-5.3 

166 

l,2,3-trichloro-l,l-dinuoropropane 

114.3 

102.0 

12.3 

167 

1 , 1 ,  l-trichloro-3 ,3 ,3- trif  luoropropane 

95.1 

109.6 

-14.5 

168 

l,l,2-trichloro-3,3,3-trifluoropropane 

106.8 

103.0 

3.8 

169 

2,3,3-trichloro-l,l,l,3-tetrafluoropropane 

87.2 

91.4 

-4.2 

170 

l,l,3-trichloro-l,2,2,3-tetrafluoropropane 

90.5 

109.6 

-19.1 

171 

1,1,1,3-tetrachloropropane 

158.0 

142.6 

15.4 

172 

1, 1,2,3-tetrachloropropane 

180.0 

156.6 

23.4 

173 

1,1 ,  l,2-tetrachloro-2-lluoropropane 

139.6 

113.5 

26.1 

174 

1,1,2,2-tetrachloro-l-fluoropropane 

135.0 

114.4 

20.6 

175 

l,l,l,3-tetrachloro-3,3-dilluoropropane 

132.0 

112.1 

19.9 

176 

l,l,l,2-tetrachloro-3,3,3-trinuoropropane 

125.1 

141.4 

-16.3 

177 

l,l,2,3-tetrachloro-l,3,3-tritluoropropane 

128.7 

111.9 

16.8 

178 

l,l,3,3-tetrachloro-2,2,3-trinuoropropane 

127.0 

105.6 

21.4 

179 

l,l,2,3-tetrachloro-l,2,3,3-tetranuoropropane 

112.5 

97.0 

15.5 

180 

l-lluoro  pro  pane 

-2.3 

24.7 

-27.0 
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No. 

Chemical  Name 

Normal 

Boiling 

Point 

Est, 

Boiling 

Point 

Residual 

Boiling 

Point 

181 

octatluoropropane 

-38.0 

-0.5 

-37.5 

182 

2,2-dinuoropropai\e 

-0.5 

11.3 

-11.8 

183 

1,1,1,3-tetralluoropropaue 

29.4 

23.3 

6.1 

184 

1,1,1 ,3,3 ,3‘hexalluoropropane 

0.8 

18.2 

-17.4 

185 

1,1,1,2,2,3,3'heptanuoropropane 

-17.0 

11.7 

-28.7 

186 

1-chloro-l-tluoropropane 

48.0 

74.8 

-26.8 

187 

3-chloro-l,  1 ,  l-tritluoropropane 

45.1 

65.1 

-19.0 

188 

2-chlorO“l,Ldifluoropropane 

52.0 

74.7 

-22.7 

189 

2-chloro-  i,  1 ,  l-tritluoropropane 

30.0 

47.7 

-17.7 

190 

1-tluorobutane 

32-2 

66.7 

—34.5 

191 

2-tluorobutane 

24.7 

46.3 

-21.6 

192 

1,1,1,2,2,4,4,4'octatluorobutane 

18.0 

59.5 

-41.5 

193 

1,1,2,2,3,3,4,4-octatluorobutane 

43.0 

38.1 

4.9 

194 

1,1,1,2,2,3,3,4,4-nonatluorobutane 

14.0 

55.4 

-41.4 

195 

decailuorobutane 

-2.0 

43.0 

-45.0 

196 

1-chlorobutane 

78.5 

90.1 

-11.6 

197 

2-chlorobutane 

68.5 

58.6 

9.9 

198 

l-chloro-4-tluorobutane 

115.0 

109.7 

5.3 

199 

1-chloro-l,  1-ditluorobutaiie 

55.5 

56.1 

-0.6 

200 

3 -ch  loro- 1, 1 ,  l-thtluorobutane 

66.0 

72.3 

-6.3 

201 

l-chloro- 1 , 1 ,3,3-tetranuorobutane 

70.5 

78.0 

-7.5 

202 

2-chloro- 1, 1 , 1 ,3 ,3,3-hexatluorobutane 

51.0 

74.2 

-23.2 

203 

4-chloro- 1,1, 1 ,2,2,3 ,3-heptatluorobutane 

54.0 

49.3 

4.7 

204 

4-chloro-l,l,l,2,2,3,3,4,4-nonafluorobutane 

30.0 

45.0 

-15.0 

205 

1, 1-dichlorobutane 

115.0 

145.0 

-30.0 

206 

1,2-dichlorobutane 

123.5 

150.2 

-26.7 

207 

1,3-dichlorobutane 

133.0 

140-8 

-7.8 

208 

1,4-dichlorobutane 

155.0 

130.8 

24.2 

209 

l,3-dichloro-l,l,3-tritluorobutane 

129.0 

80.9 

48.1 

210 

3,4-dichloro-l,l,l,2,2,3-hexatluorobutane 

72.0 

78.5 

-6.5 

211 

l,4-dichloro-l,l,3-tritluorobutane 

118.5 

97.5 

21.0 

212 

2,3-dichloro-l,  1,1,4,4 ,4-hexatluorobutane 

78.0 

71.9 

6.1 

213- 

4,4-dichloro-l,l,l,2,2,3,3-heptatluorobutane 

76.5 

72.0 

4.5 

214 

4,4-dichloro-l,l,l,2,2,3,3,4-octatluorobutane 

62.8 

71.1 

-8.3 

215 

3,4-dichloro-l,l,l,2,2,3,4,4-octalluorobutane 

66.0 

70.6 

-4.6 

216 

l,4-dichloro-l,l,2,2i3,3,4,4-octafluorobutane 

64.0 

71.3 

-7.3 

217 

2,2-dichloro-l,l,l,3,3,4,4,4-octatluorobutane 

64.0 

70.5 

-6.5 

218 

2,3-dichloro-l,l,l,2,3,4,4,4-octafluorobutane 

64.0 

76.3 

-12.3 

219 

1, 1, 1-trichlorobutane 

133.5 

137.9 

-4.4 

220 

1,1,2-trichlorobutane 

156.8 

143.5 

13.3 

221 

1,1,3-trichlorobutane 

153.8 

143.3 

10.5 

222 

1,1,4-trichlorobutane 

183.8 

133.3 

50.5 

223 

2,2,3-trichloro-l,l,l,4,4,4-hexatluorobutane 

104.0 

107.7 

-3.7 

224 

4  4  4-trichloro-l,l,l,2,2,3,3-heptalluorobutane 

96.5 

85.4 

11.1 

225 

l,3,4-trichloro-l,l,2,2,3,4,4-heptatluorobutane 

99.0 

91.3 

7.7 

ESTIMATION  OK  THE  NORMAL  HOIUNC  POINTS  OK  HALOALKANKS 


1167 


TABLE  I 
(continuing) 


No. 

Chemical  Name 

Normal 

Boiling 

Point 

Est. 

Boiling 

Point 

Kcsidua 

Boiling 

Point 

226 

2,2,34nchloro-l,l,l,3,4,4,4-heptatluorobutane 

97.4 

77.7 

19.7 

227 

1,1,4,4-tetrachlorobutane 

200.0 

148.7 

51.3 

228 

l,2,4,4-tetrachloro-l,l,2,3,3,4-hexalluorobutane 

134.0 

92.6 

41.4 

229 

l,2,3,4-tetrachloro-l,l,2,3,4,4-hexatluorobutane 

134.0 

85.2 

48.8 

230 

l,l,2,3,4,4-hexachloro-l,2,3,4-tetralluorobutane 

208.0 

113.2 

94.8 

231 

1-chloroiso  butane 

68.3 

60.5 

7.8 

232 

2-chloroisobutane 

50.7 

38.5 

12.2 

233 

l-chloro-l-Huoroisobutane 

82.5 

109.8 

-27.3 

234 

1 , 1-dichloroisobutane 

105.0 

107.4 

-2.4 

235 

1,2-dichloroisobutane 

106.5 

99.1 

7.4 

236 

1,3-dichloroisobutane 

136.0 

134.6 

1.4 

237 

1,1'dichloio-l-fluoroisobutane 

107.0 

116.1 

-9.1 

238 

1,2,3-trichloroisobutane 

163.0 

146.0 

17.0 

239 

1 , 1,2,3-tetrachloroisobutane 

191.0 

185.2 

5.8 

240 

l,2,3-trichloro>2-chloromethylpropane 

211.0 

183.3 

27.7 

241 

l,l,2,3-tetrachloro-2-chloromethylpropane 

227.0 

204.3 

22.7 

242 

l-fluoroisobutane 

16.0 

56.1 

-40.1 

243 

2-fluoroisobutane . 

12.0 

38.1 

-26.1 

244 

1, 1, 1  »3,3 ,3>hexafluoroisobutane 

21.5 

25.5 

-4.0 

245 

l,l,l>3.3,3-hexafluoro-2-fluoromethylpropane 

40.0 

18.9 

21.1 

246 

l,l,l,3,33-hexafluoro-2-dinuoromethylpropane 

33.0 

20.3 

12.7 

247 

l,l,l,3,3,3“hexatluoro-2-trifluoromethylpropane 

12.0 

6.0 

6.0 

248 

decatluoro  isob  u  tane 

-0.3 

3.6 

-3.9 

249 

3-chloro-l,l,l,3,3-pentatluoroisobutane 

59.0 

68.6 

-9.6 

250 

l,l,l>3,3,3-hexafluoro-2-chloromethylpropane 

58.0 

39.6 

18-4 

251 

2,3-dichloro-l,l,l-trifluoroisobutane 

93.5 

101.0 

-7.5 

252 

2,3-dichloro*l,l,l,3,3-pentafluoroisobutane 

75.3 

91.2 

-15.9 

253 

2,3-dichloro-l,l,l,3,3-pentafluoro-2-trifluoromethylpropane 

65.0 

65.8 

-0.8 

254 

1,1,2-tnchloroisobutane 

163.0 

143.1 

19.9 

255 

l,2,3-trichloro-l,l-difluoroisobutane 

132.0 

114.7 

17.3 

256 

2,3,3-trichloro- 1, 1, 1-trifluoroisobutane 

123.7 

124.6 

-0.9 

267 

1,1,1 ,3,3,3-hexatluorO'2-trichloromethylpropane 

107.0 

106.7 

0.3 

258 

1 , 1 , 1 ,2'  tetrachloro-3 , 3 , 3-tr  ifluo  roisobutane 

148.5 

148.7 

-0.2 

259 

1,1,1,2,3-pentachloroisobutane 

211.0 

201.3 

9.7 

260 

l‘Chloro-l,l,2,2-tetrafluoropropane 

19.9 

27.5 

-7.6 

261 

1, 1, 1-trichloropropane 

104.0 

99.6 

4.4 

262 

2,3-dichlorobutane 

116.0 

105.2 

10.8 

263 

2,2,3-trichlorobutane 

143.0 

152.3 

-9.3 

264 

1 ,2,3-trichlorobutane 

166.0 

141.7 

24.3 

265 

1 ,4-diiluorobutane 

77.8 

122.0 

-44.2 

266 

2, 2-diiluoro  butane 

30.9 

40.0 

-9.1 

267 

1 ,2-dinuoroethane 

26.0 

40.7 

-14.7 
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TABLE  II 

Topological  index  symbols  and  definitions 

Id  Information  index  for  the  magnitudes  of  distances  between  all  possible 

_  pairs  of  vertices  of  a  graph 

Id  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of'  the  off-diagonal  elements  of  the  distance 
matrix  of  a  gi'aph 
I^  Degree  complexity 

Graph  vertex  complexity 

Graph  distance  complexity 

IC  Information  content  of  the  distance  matrix  partitioned  by  the  frequency 

of  occurrences  of  distance  h 

O  Order  of  neighbourhood  when  ICr  reaches  its  maximum  value  for  the 

hydrogen-filled  graph 

loRB  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighbourhood  of  vertices 

OoRB  Maximum  order  of  neighbourhood  of  vertices  for  loRB  within  the 
hydrogen-suppressed  graph 

Ml  A  Zagreb  group  parameter  =  sum  of  the  square  of  degree  over  all  vertices 

M2  A  Zagreb  group  parameter  =  sum  of  the  cross-product  of  degrees  over  all 

neighbouring  (connected)  vertices 

ICr  Mean  information  content  or  complexity  of  a  graph  based  on  the 

(r  =  0-3)  order  neighbourhood  of  vertices  in  a  hydrogen-filled  graph 
SICr  Structural  information  content  for  the  (r  =  0-3)  order  neighbourhood  of 
vertices  in  a  hydrogen-filled  graph 

ClCr  Complementary  information  content  for  the  (r  =  0—3)  order 
neighbourhood  of  vertices  in  a  hydrogen-filled  graph 
Path  connectivity  index  of  the  order  h  =  0-5 

^'Xc  Cluster  connectivity  index  of  the  order  h  =  3-6 

^'Xpc  Path-cluster  connectivity  index  of  the  order  h  =  4—6 

^*X''  Valence  path  connectivity  index  of  the  order  h  =  0-5 

^'Xc  Valence  cluster  connectivity  index  of  the  order  h  =  3—6 

^'Xpc  Valence  path-cluster  connectivity  index  of  the  order  h  =  4-6 

P/i  Number  of  paths  of  length  h  =  0—5 

J  Balaban’s  J  index  based  on  distance 

Balaban’s  J  index  based  on  relative  electronegativities 


Balaban’s  J  index  based  on  relative  covalent  radii 
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(PCs)  aie  derived  from  the  correlation  matrix  The  first  PC 
nas  the  largest  variance,  or  eiffenvahiP  nf  r  “““'“x-  ineiirstPC 

Each  subsequent  PC  explains  th?m«v  i  ^  ^  combination  of  TIs. 

vious  PCs.  With  59  TIs  available  59  ™  IT  vanance  orthogonal  to  pre- 

PCs  wiU,  an  eigenvalue  greater^' “n  ^0"  eSS “he  PcJL* 

casaK  ef  al.  provide  more  detail  on  this  approach. 


- u/  locniiiarity 

witlhn^™lTmTnS^arfp^  Euclidean  distance  (ED) 

thogonal  variables  (PCs)  derived  from  the  TIs  ED^  consisted  of  or- 

i  and  j  is  defined  as:  between  the  molecule’s 


ED^  = 


1/2 


are  the  daltlie^ 

K-nearest  Neighbour  Selection  and  Boiling  Point  Estimation 
LLSrarnfpo„’'nd  w“  as  to  “"t  dtV'  a* '?-'~^gi" 


greaferton%nf„e«“totfT£e‘'''?t’pt“  “Sa-alues 

95.0%  of  to  total  va^U^^L’^^S  T?‘?au  suniulatively, 

ues  of  the  eight  PCs  the  nronortmn  r.e  ■  (^Ee  eigenval- 

the  cumulative  variance  explained  In  ^C,  and 

most  correlated  with  each  PC  The  first  PC^s^’t  ^  t 
parameters  that  characterize  the  size  of  to  eorrelated  with  to 

oomplexity  indices  including  SIC,  and  CIO*.  Fe? to^^frd%‘^ 
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TABLE  III 


Summary  of  the  principal  components  of  59  TIs  for  the  267  haloalkanes  and  the 
correlation  coefficients  of  the  two  most  correlated  with  each  principal  component 


PC 

Eigenvalue 

Percent  of 
variance 

Cumulative 

percent 

First 

correlated  TI 

Second 
correlated  TI 

1 

31.9 

54.0 

54.0 

Po 

0.982 

Pi 

0.982 

2 

8.7 

14.8 

68.8 

SIC2 

0.949 

CIC2 

-0.922 

3 

5.2 

8.8 

77.6 

-0.668 

-0.637 

4 

3.6 

6.1 

83.7 

1-/V 

0.475 

3yv 

0.440 

5 

2.1 

3.6 

87.3 

lyv 

0.495 

4yv 

0.482 

6 

1.9 

3.2 

90.5 

Ps 

0.579 

0.574 

7 

1.5 

2.5 

93.0 

2y.v 

0.282 

0.280 

8 

1.2 

2.0 

95.0 

0.324 

iT' 

-0.313 

correlations  occur  with  the  valence  cluster  connectivity  TIs  such  as  and 
The  fourth  PC  was  characterized  by  lower  order  valence  path  connec¬ 
tivity  indices  such  as  ^X^^  and  ^X'^  and  the  fifth  PC  by  the  higher  order  valence 
path  connectivity  indices  such  as  ^X'^  and  ^X'^.  Interpretation  beyond  the  fifth 
level  PC  becomes  more  difficult,  as  it  can  be  seen  in  Table  III.  These  PCm 
correlations  agree  with  our  expectations  based  on  previous  research. 
Generally,  PCs  and  TIs  correlate  as  follows:  PC^  with  the  size  of  the 

molecular  graph,  PC2 
with  higher  order  com¬ 
plexity  indices,  PC3  with 
cluster  and  path-cluster  con¬ 
nectivity,  and  PC4  with  low 
order  information  theoretic 
indices. 


TABLE  IV 

Summary  of  the  K-nearest  neighbour  normal 
boiling  point  estimation  for  267  chlorofluoro- 
hydrocarbons 


K 

r 

s.e.  (”C) 

1 

0.854 

33.2 

2 

0.908 

26.4 

3 

0.923 

24.5 

4 

0.927 

24.2 

5 

0.933 

23.7 

6 

0.934 

24.3 

7 

0.934 

24.3 

8 

0.936 

24.3 

9 

0.939 

24.4 

10 

0.939 

24.7 

15 

0.936 

26.2 

20 

0.936 

27.7 

25 

0.943 

28.0 

Table  IV  reports  the 
correlation  and  standard 
errors  of  boiling  point  esti¬ 
mates  obtained  by  the  K- 
nearest  neighour  estimation 
with  the  observed  boiling 
point  values.  Each  line  of 
the  table  represents  a  dif¬ 
ferent  K  level.  The  stand¬ 
ard  error  for  estimation  was 
at  its  minimum  of  23.7  °C 
for  if  =  5.  The  correlation, 
however,  continued  an  up¬ 
ward  trend  as  K  increased. 
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DISCUSSION 

The  goal  of  this  paper  was  to  investigate  the  usefulness  of  general  simi¬ 
larity  methods  based  on  gi’aph  invariants  in  the  prediction  of  the  boiling 
points  of  a  set  of  267  chlorofluorocarbons.  To  this  end,  we  used  Euclidean 
distance  in  an  eight  dimensional  PC-space  as  the  measure  of  structural 
similarity/dissimilarity  of  CFCs.  The  results  in  Table  IV  show  that  the  best 
estimates  of  the  normal  boiling  point  are  obtained  at  K  =  b.  Our  previous 
studies  on  similarity-based  prediction  of  properties  like  lipophilicity,^’,  boil¬ 
ing  point, and  mutagenicity*®'  have  shown  that  a  small  number  of 
neighbours  {K  =  5-10)  will  usually  give  the  best  results  in  property  estima¬ 
tion. 

Comparison  of  the  iif-nearest  neighbour  estimates  reported  in  this  paper 
with  previous  studies  on  the  same  set  of  CFCs  shows  that  similarity-based 
estimates  are  inferior  to  predictions  derived  by  neural  net  models.®®  In  the 
neural  net  model,  parametrization  was  done  with  an  eye  to  specific  struc¬ 
tural  features  of  CFCs.  In  contrast,  the  PC-based  similarity  approach  used 
a  set  of  general  structural  parameters  which  quantify  such  structural  fea¬ 
tures  of  chemical  graphs  as  size,  shape,  degree  of  branching,  etc.  Yet,  simi¬ 
larity  methods  based  on  such  graph  theoretic  parameters  give  a  reasonably 
good' estimate  of  the  normal  boiling  point  of  CFCs  analyzed  in  this  paper. 
The  usefulness  of  the  similarity  approach  depends  on  the  context,  i.e.  what 
level  of  accuracy  is  required. 

In  risk  assessment,  molecular  similarity  is  used  in  the  selection  of  ana¬ 
logs  of  chemicals  for  hazard  estimation.  Very  often,  one  has  to  do  rapid  es¬ 
timation  of  a  large  number  of  properties.  Such  estimations  should  be  based 
on  parameters  that  can  be  algorithmically  derived,  i.e.,  can  be  computed  for 
any  chemical  species  directly  from  structure.  The  graph  invariants  used  in 
this  paper  fall  into  this  category.  The  results  reported  here  show  that  such 
methods  can  be  used  as  a  first  order  estimation  of  properties. 

The  parameters  used  in  this  paper  did  not  include  any  stereoelectronic 
property  that  might  influence  the  normal  boiling  points  of  CFCs.  It  would 
be  interesting  to  see  whether  similarity  methods  give  better  estimates  of 
boiling  points  when  stereoelectronic  variables  are  included  in  the  set  of  pa¬ 
rameters.  Such  studies  are  in  progress  and  will  be  reported  subsequently. 
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SAZETAK 

Procjena  normalnih  vrelilta  haloalkana  na  osnovi  molekulske 

slicnosti 

SubhcLsh  C.  Basak,  Brian  D.  Gute  i  Gregory  D.  Grunwald 

Molekulska  slicnost  upotrijebljena  je  za  procjenu  normalnih  vrelista  skupa  od 
276  haloalkana  s  1  do  4  ugljikova  atoma.  Molekulska  slicnost/razlicitost  kvantifici- 
rana  je  Euklidovom  udaljenoscu  molekula  u  osmerodimenzijskom  prostoru  glavnih 
komponenti  izvedenih  iz  59  topoloskih  indeksa.  Koeficijent  korelacije  izmedu  ekspe- 
nmentalnih  i  procijenjenih  vrelista  iznosi  izmedu  0.854  i  0.943  za  procjene  vrelista 
pomocu  K  najblizih  susjeda,  uz  razlicite  brojeve  najblizih  susjeda  {K  =  1,  ...,  10,  15, 
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Abstract 


Three  similarity  spaces  were  used  in  the  selection  of  analogs  and  /^-nearest 
neighbor  (KNN)  based  estimation  of  normal  boiling  points  for  a  diverse  set  of  2926 
chemicals.  The  similarity  spaces  consisted  of  principal  components  (PCs)  derived  from: 
1)  40  topostructural  indices,  2)  61  topochemical  parameters  and  3)  the  full  set  101 
topostructural  and  topochemical  indices.  The  three  methods  selected  sets  of  analogs 
with  a  substantial  number  of  structurally  analogous  molecules.  For  the  KNN  method  of 
property  estimation,  the  similarity  space  which  used  the  full  set  of  indices  was  superior 
to  either  of  the  subsets  (topostructural  or  topochemical).  For  all  three  methods,  K=  6-10 
gave  the  best  estimated  values  for  boiling  point. 


1. 


Introduction 


Interest  in  quantifying  the  similarity  of  molecules  using  computational  methods 
has  increased  [1-8].  In  particular,  a  recent  trend  in  the  characterization  of  similarity/ 
dissimilarity  of  chemicals  makes  use  of  graph  invariants.  Molecular  structures  can  be 
represented  by  planar  graphs,  G  =  [V,E\,  where  the  nonempty  set  represents  the  set 
of  atoms  and  the  set  E  generally  represents  covalent  bonds  [9].  These  graphs  can  be 
used  to  adequately  represent  the  pattern  of  connectedness  of  atoms  within  a  molecule. 
Graph  invariants,  values  derived  from  planar  graphs,  are  graph  theoretic  properties 
which  are  identical  for  isomorphic  graphs.  A  numerical  graph  invariant  or  topological 
index  maps  a  chemical  structure  into  the  set  of  real  numbers. 

Various  graph  invariants  have  been  used  in  ordering  and  partial  ordering  of  sets 
of  molecules  [1,  4-8].  Various  topological  indices  (TIs)  and  principal  components  (PCs) 
derived  from  TIs  have  been  used  in  quantifying  the  similarity/dissimilarity  of  molecules 
and  in  the  similarity  based  estimation  of  physical  and  toxicological  properties  [4,  5,  10- 
17].  Such  TIs  include  those  derived  from  simple  planar  graphs  which  contain  adjacency 
and  distance  information  for  vertices.  These  TIs  could  be  considered  topostructural 
indices.  Other  TIs,  which  are  derived  from  weighted  chemical  graphs,  could  be  called 
topochemical  indices  because  they  contain  explicit  information  regarding  the  chemical 
nature  of  the  atoms  (vertices)  and  bonds  (edges)  in  the  molecular  structure,  in  addition 
to  quantifying  the  adjacency  and  distance  relationships  within  the  graph. 

Our  earlier  studies  made  use  of  a  combination  of  topostructural  and 
topochemical  indices  to  select  analogs  of  chemicals  and  estimate  properties  of 
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molecules  in  large  and  diverse  databases  using  the  K-nearest  neighbor  (KNN)  method. 

In  this  paper  we  have  carried  out  a  comparative  analysis  of  similarity  based  analog 
selection  and  KNN  based  estimation  of  normal  boiling  point  using  :  a)  a  set  of  40 
topostructural  indices,  b)  a  group  of  61  topochemical  indices,  and  c)  the  combined  set 
of  101  indices. 

2.  Methods 

2.1  DATABASE 

The  normal  boiling  point  database  consisted  of  2926  compounds  taken  from  the 
U.S.  ERA  ASTER  [18]  system.  This  data  comprised  a  set  for  which  chemical  structures 
and  normal  boiling  values  were  available,  and  for  which  it  was  possible  to  compute  all 
101  TIs. 

2.2  CALCULATION  OF  INDICES 

The  TIs  calculated  for  this  study  are  listed  in  table  1  and  include  Wiener  number 
[19],  molecular  connectivity  indices  as  calculated  by  Randic  [20]  and  Kier  and  Hall  [21], 
frequency  of  path  lengths  of  varying  size,  information  theoretic  indices  defined  on 
distance  matrices  of  graphs  using  the  methods  of  Bonchev  and  Trinajstic  [22]  as  well  as 
those  of  Raychaudhury  et  al.  [23],  parameters  defined  on  the  neighborhood  complexity 
of  vertices  in  hydrogen-filled  molecular  graphs  [24-26],  and  Balaban's  J  indices  [27-29]. 
The  majority  of  the  TIs  were  calculated  using  POLLY  2.3  [30].  The  J  indices  were 
calculated  using  software  developed  by  the  authors. 
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The  Wiener  index  (W).  the  first  topological  index  reported  in  the  chemical 
literature  [19],  may  be  calculated  from  the  distance  matrix  D(G)  of  a  hydrogen- 
suppressed  chemical  graph  G  as  the  sum  of  the  entries  in  the  upper  triangular  distance 
submatrix.  The  distance  matrix  D{G)  of  a  nondirected  graph  G  with  n  vertices  is  a 
symmetric  nxn  matrix  where  is  equal  to  the  distance  between  vertices  v,-  and  Vj 
in  G.  Each  diagonal  element  d^;Of  D(G)  is  zero.  We  give  below  the  distance  matrix 
D{G,)  of  the  unlabeled  hydrogen-suppressed  graph  G,  of  n-propanol  (figure  1): 


(1)  (2)  (3)  (4) 

0  12  3 

10  12 
2  10  1 

3  2  10 


Wis  calculated  as; 

W=yzZd,j  =  i:h-g, 


(1) 


where  g^  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h.  Thus  for 
D(G,),  W  has  a  value  of  ten. 


[Insert  Fig.  1  here] 

Randle's  connectivity  index  [20],  and  higher-order  connectivity  path,  cluster, 
path-cluster  and  chain  types  of  simple,  bond  and  valence  connectivity  parameters  were 
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calculated  using  the  method  of  Kier  and  Hall  [21],  The  generalized  form  of  the  simple 
path  connectivity  index  is  as  follows: 

(2) 

paths 

where  v,.  Vy,...,  are  the  degrees  of  the  vertices  in  the  path  of  length  h.  The  path 
length  parameters  (PJ,  number  of  paths  of  length  h{h  =  0,1,... ,10)  in  the  hydrogen- 
suppressed  graph,  are  calculated  using  standard  algorithms. 

Information-theoretic  topological  indices  are  calculated  by  the  application  of 
information  theory  on  chemical  graphs.  An  appropriate  set  A  of  n  elements  is  derived 
from  a  molecular  graph  G  depending  upon  certain  structural  characteristics.  On  the 
basis  of  an  equivalence  relation  defined  on  A,  the  set  A  is  partitioned  into  disjoint 

subsets  Ai  of  order  n,  (/  =  1 , 2, . .  h;  E  n,  =  n).  A  probability  distribution  is  then 

assigned  to  the  set  of  equivalence  classes: 

Ap  ^2, . .  Af, 

Pl>  P2> . '  Ph 


where  P/=  n,/  n  is  the  probability  that  a  randomly  selected  element  of  A  will  occur  in  the 
P  subset. 

The  mean  information  content  of  an  element  of  A  is  defined  by  Shannon's 
relation  [31]: 
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(3) 


h 

/C=-E  p,  1092 p, 

i=1 

The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in  bits.  The  total 
information  content  of  the  set  A  is  then  nx  1C. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding  pattern, 
Sarkar  et  al.  [32]  calculated  information  content  of  chemical  graphs  on  the  basis  of  an 
equivalence  relation  where  two  atoms  of  the  same  element  are  considered  equivalent  if 
they  possess  an  identical  first-order  topological  neighborhood.  Since  properties  of 
atoms  or  reaction  centers  are  often  modulated  by  stereo-electronic  characteristics  of 
distant  neighbors,  he.,  neighbors  of  neighbors,  it  was  deemed  essential  to  extend  this 
approach  to  account  for  higher-order  neighbors  of  vertices.  This  can  be  accomplished 
by  defining  open  spheres  for  all  vertices  of  a  chemical  graph.  If  ris  any  non-negative 
real  number  and  vis  a  vertex  of  the  graph  G,  then  the  open  sphere  S{v,  i)  is  defined  as 
the  set  consisting  of  all  vertices  v,  in  G  such  that  c/(v,v})  <  r.  Therefore,  S(v,  0)  =  (p,  S{y, 
i)  =  vfor  0  <  r<  1,  and  S{v,i)  is  the  set  consisting  of  vand  all  vertices  v,of  G  situated  at 
unit  distance  from  v,  if  1<r<2. 

One  can  construct  such  open  spheres  for  higher  integral  values  of  r.  For  a 
particular  value  of  r,  the  collection  of  all  such  open  spheres  S(v,r),  where  v  runs  over 
the  whole  vertex  set  V,  forms  a  neighborhood  system  of  the  vertices  of  G.  A  suitably 
defined  equivalence  relation  can  then  partition  Uinto  disjoint  subsets  consisting  of 
vertices  which  are  topologically  equivalent  for  order  neighborhood.  Such  an 
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approach  has  been  developed  and  the  information-theoretic  indices  calculated  based 
on  this  idea  are  called  indices  of  neighborhood  symmetry  [26]. 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two 
vertices  and  of  a  molecular  graph  are  said  to  be  equivalent  with  respect  to  order 

neighborhood  if  and  only  if  corresponding  to  each  path  u„,  u, . u^of  length  r,  there  is 

a  distinct  path  v„,  v„  ....  v^of  the  same  length  such  that  the  paths  have  similar  edge 
weights,  and  both  and  are  connected  to  the  same  number  and  type  of  atoms  up  to 
the  order  bonded  neighbors.  The  detailed  equivalence  relation  has  been  described  in 
earlier  studies  [26,  33). 

Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood  is 
completed,  IC/\s  calculated  by  eq.  (2).  Basak  et  al.  defined  another  information- 
theoretic  measure,  structural  information  content  {SIC^,  which  is  calculated  as: 

SIC,=  ICJ\og2n  (4) 

where  1C,  is  calculated  from  eq.  (2)  and  n  is  the  total  number  of  vertices  of  the  graph 
[24]. 

Another  information-theoretic  invariant,  complementary  information  content 
{CIC,),  is  defined  as: 


CIC,=  loggH-  1C, 
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CICr  represents  the  difference  between  maximum  possible  complexity  of  a  graph 
(where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the  realized 
topological  information  of  a  chemical  species  as  defined  by  ICf[25]. 

In  figure  2,  the  calculation  of  IC^,  SIC2  and  CIC2  is  demonstrated  for  the  labeled 
hydrogen-filled  graph  (G^)  of  n-propanol. 

[Insert  Fig.  2  here  ] 

The  information-theoretic  index  on  graph  distance,  Ip  is  calculated  from  the  distance 
matrix  D(G)  of  a  chemical  graph  G  as  follows  [22]: 

Id  =  tVlogz  W-Eg^*  h  logg  h  (6) 

h 

The  mean  information  index,  Iq,  is  found  by  dividing  the  information  index  \N.  The 
information  theoretic  parameters  defined  on  the  distance  matrix,  hP  and  hP ,  were 
calculated  by  the  method  of  Raychaudhury  etal  [231 

Balaban  defined  a  series  of  indices  based  upon  distance  sums  within  the 
distance  matrix  for  a  chemical  graph  which  he  designated  as  J  indices  [27-29].  These 
indices  are  highly  discriminating  with  low  degeneracy.  Unlike  W,  the  J  indices  range  of 
values  are  independent  of  molecular  size.  The  general  form  of  the  J  index  Calculation  is 
as  follows: 

J=q(p  +  1)'^  E  (s,sT^  (7) 

'  l,i,  edges  '  '  '' 
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where  the  cyclomatic  number  p  (or  number  of  rings  in  the  graph)  is  p  =  q-n+1 ,  with  q 
edges  and  n  vertices  and  s,  is  the  sum  of  the  distances  of  atom  /  to  all  other  atoms  and 
Sy  is  the  sum  of  the  distances  of  atom  jXo  all  other  atoms  [27].  Variants  were  proposed 
by  Balaban  for  incorporating  information  on  bond  type,  relative  electronegativities,  and 
relative  covalent  radii  [28-29]. 

2.3  CLASSIFICATION  OF  THE  INDICES 

The  set  of  101  TIs  was  partitioned  into  two  distinct  subsets:  topostructural 
indices  and  topochemical  indices.  Topostructural  indices  encode  information  about  the 
adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs) 
irrespective  of  atom  type  or  factors  such  as  hybridization  states  and  number  of  core/ 
valence  electrons  in  individual  atoms.  Topochemical  indices  quantify  information 
regarding  specific  chemical  properties  of  the  atoms  comprising  a  molecule  as  well  as 
the  topology  (connectivity  of  atoms).  Topochemical  indices  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom)  is  properly  weighted  with  selected 
chemical/physical  properties.  These  subsets  are  shown  in  table  1. 

2.4  STATISTICAL  METHODS  AND  COMPUTATION  OF  SIMILARITY 
Data  Reduction 

Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one. 
This  was  done  since  the  scale  of  some  TIs  may  be  several  orders  of  magnitude  greater 
than  other  TIs. 
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A  principal  component  analysis  (PCA)  was  used  on  the  transformed  indices  to 
minimize  intercorrelation  of  indices.  The  PCA  analysis  was  accomplished  using  the 
SAS  procedure  PRINCOMP  [34].  The  PCA  produces  linear  combinations  of  the  TIs, 
called  principal  components  (PCs)  which  are  derived  from  the  correlation  matrix.  The 
first  PC  has  the  largest  variance,  or  eigenvalue,  of  the  linear  combination  of  TIs.  Each 
subsequent  PC  explains  the  maximal  index  variance  orthogonal  to  the  previous  PCs, 
eliminating  any  redundancies  which  could  occur  within  the  set  of  TIs.  The  maximum 
number  of  PCs  generated  is  equal  to  the  number  of  TIs  available.  For  the  purposes  of 
this  study,  only  PCs  with  eigenvalues  greater  than  one  were  retained.  A  more  detailed 
explanation  of  this  approach  has  been  provided  in  a  previous  study  by  Basak  et  a/ [4]. 
These  PCs  were  subsequently  used  in  determining  similarity  scores  as  described 
below. 

Similarity  Measures 

Intermolecular  similarity  was  measured  by  the  Euclidean  distance  (ED)  within  an 
r?-dimensional  space.  This  n-dimensional  space  consisted  of  orthogonal  variables  (PCs) 
derived  from  the  TIS  as  described  above.  ED  between  the  molecules  /  and  /  is  defined 
as: 

r  n  1 

ED,=  (8) 

where  n  equals  the  number  of  dimensions  or  PCs  retained  from  the  PCA.  Dn^  and  Dy^ 
are  the  data  values  of  the  1^  dimension  for  chemicals  /  and  /,  respectively. 
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K-Nearest  Neighbor  Selection  and  Property  Estimation 

Following  the  quantification  of  intermolecular  similarity  of  the  2926  chemicals,  the 
/C-nearest  neighbors  {K=  1-10, 15,  20,  25)  were  determined  on  the  basis  of  ED.  This 
procedure  can  be  used  to  select  structural  analogs  (neighbors)  of  a  probe  compound  or 
the  neighbors  can  be  used  in  property  estimation.  In  estimating  the  normal  boiling  point 
of  the  probe  compound,  the  mean  observed  normal  boiling  point  of  the  /C-nearest 
neighbors  was  used  as  the  estimate  and  the  standard  error  (s)  of  the  estimate  was 
used  to  assess  the  efficacy  of  the  set  of  indices. 

3.  Results 

3.1  PRINCIPAL  COMPONENT  ANALYSIS 

From  the  PCA  of  the  40  topostructural  indices,  seven  PCs  with  eigenvalues 
greater  than  one  were  retained.  These  seven  PCs  explained,  cumulatively,  90.8%  of 
the  total  variance  within  the  Tl  data.  Table  2  lists  the  eigenvalues  of  the  seven  PCs,  the 
proportion  of  variance  explained  by  each  PC,  the  cumulative  variance  explained,  and 
the  three  TIs  most  correlated  with  each  individual  PC. 

The  PCA  of  the  61  topochemical  indices  resulted  in  the  selection  of  ten  PCs,  all 
having  eigenvalues  greater  than  one.  The  ten  PCs  explain  a  total  of  92.1%  of  the 
variance  within  the  Tl  data.  Table  3  presents  a  summary  of  the  information  regarding 
these  ten  PCs. 

Twelve  PCs  were  retained  from  the  PCA  of  the  full  set  of  101  TIs.  Each  of  these 
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PCs  had  an  eigenvalue  greater  than  one  and,  cumulatively,  they  explained  92.87o  of 
the  variance  within  the  full  set  of  TIs.  These  PCs  are  summarized  in  table  4. 

3.2  ANALOG  SELECTION 

Figure  3  shows  an  example  of  analog  selection  using  PCs  to  derive  a  Euclidean 
distance  space.  The  first  five  analogs  (neighbors)  for  the  probe  compound,  3-methyl-4- 
chlorophenol,  are  presented  for  each  of  the  three  similarity  spaces.  The  analogs 
selected  by  the  topostructural  model  show  a  repetition  of  the  same  skeletal  structure, 
ignoring  substituents,  throughout  the  first  five  analogs.  In  the  topochemical  model  and 
the  full  set  model  some  variability  in  the  skeletal  structure  arises  (chemical  analogs  2  & 

5,  full  set  analog  4).  Also  of  interest  is  the  repetition  of  chemicals  between  the  sets  of 
analogs.  While  the  ordering  varies  between  the  methods,  the  topostructural  and 
topochemical  models  select  twojdentical  structures,  the  topostructural  and  the  full  set 
have  three  analogs  in  common,  and  the  topochemical  and  full  set  select  four  of  the 
same  analogs.  2-chloro-5-methylphenol  appears  in  all  three  sets,  while  there  are  only 
three  unique  compounds  (topostructural  analogs  4  &  5,  topochemical  analog  5). 

[Insert  Fig.  3] 

3.3  K-NEAREST  NEIGHBOR  PROPERTY  ESTIMATION 

Figure  4  presents  the  correlation  (r)  and  the  standard  error  (s)  of  the  prediction  of 
the  normal  boiling  points  for  the  2926  chemicals  for  the  three  groups  of  indices  over  the 
full  range  of  /C values  examined  {K=  1-10, 15,  20,  25).  Table  5  shows  the  best  normal 
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boiling  point  modol  for  ©ach  S8t  of  indicGS.  Th©  b©st  boiling  point  ©stimat©s  for  all  thr©© 
s©ts  w©r©  for  K’m  th©  rang©  of  6  to  10.  Th©  full  s©t  of  indic©s  gav©  th©  b©st  r©sult, 
how©v©r,  th©re  was  only  a  small  differ©nce  b©twe©n  models. 

[Insert  Fig.  4] 

4.  Discussion 

The  purpose  of  this  paper  was  to  study  th©  relative  effectiveness  of  three 
similarity  spaces  derived  from  graph  invariants  in  the  selection  of  structural  analogs  and 
in  the  KNN  based  estimation  of  properties.  The  similarity  spaces  were  created  using  a 
principal  component  analysis  of  calculated  graph  invariants.  Tables  2-4  summarize  the 
results  of  the  PCA  of  the  three  sets  of  indices.  The  first  PC  is  always  correlated  with 
indices  which  quantify  molecular  size.  In  the  case  of  the  topostructural  indices,  the 
second  PC  is  most  correlated  with  branching  indices.  In  the  case  of  PCs  derived  from 
either  topochemical  or  the  full  set  of  topostructural  and  topochemical  parameters,  the 
first  PC  was  strongly  correlated  with  molecular  size,  while  the  second  PC  was  highly 
associated  with  the  molecular  complexity  indices.  These  results  are  in  line  with  our 
earlier  studies  on  different  sets  of  chemicals  [4,  5, 11,  35,  36]. 

All  three  spaces  were  used  in  the  selection  of  five  analogs  of  a  particular 
structure  (Figure  3).  Perusal  of  the  three  sets  of  structures  show  that  there  is  a 
substantial  degree  of  similarity  among  the  three  groups  of  five  chemicals  selected.  It  is 
interesting  to  note  that  all  five  nearest  neighbors  of  the  probe  selected  by  the 
topostructural  method  had  isomorphic  skeletal  graphs  when  hydrogen  atoms  are 
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suppressed.  For  the  two  similarity  spaces  created  by  topochemical  indices  aione  and 
the  combined  set  of  topostructural  and  topochemical  indices,  four  of  the  five  selected 
neighbors  are  common  (Figure  3)  although  the  ordering  of  the  molecules  is  different. 
This  shows  that  these  two  similarity  methods  are  not  intrinsically  very  different.  Our 
earlier  results  showed  that  analogs  selected  by  similarity  methods  derived  from 
experimental  physical  properties,  atom  pairs  and  topological  indices  select  very  similar 
sets  of  analogs  [10]. 

In  the  case  of  KNN  based  estimation  of  boiling  points  of  chemicals  from  their 
analogs,  /Cwas  varied  from  1  to  25.  The  best  estimated  value  was  obtained  in  the 
range  of  K=  6-10.  This  is  in  line  with  our  earlier  studies  with  different  properties  [11, 

12]. 

In  conclusion,  the  three  similarity  spaces  derived  in  this  paper  have  reasonable 
power  for  selecting  analogous  molecules  from  a  very  diverse  database  of  chemicals. 
The  KNN  based  estimation  shows  that  selected  analogs  can  be  used  for  the  estimation 
of  boiling  points  of  diverse  chemicals  if  more  accurate  methods  are  not  available. 
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Figure  Legend 


Figure  1  The  unlabeled  hydrogen-suppressed  graph  (G,)  of  n-propanol. 

Figure  2  Calculation  of  the  indices  IC2,  SIC2,  and  CIC2  for  the  hydrogen-filled, 
labeled  graph  (G2)  of  n-propanol. 


Figure  3  The  five  analogs  selected  for  the  probe  3-methyl-4-chlorophenol  using 
three  molecular  similarity  spaces:  topostructural,  topochemical,  and  all 
indices.  The  numbers  under  the  structures  indicate  the  ranking  of  the 
analogs  and  the  Euclidean  distance  to  the  probe. 


Figure  4  Pattern  of:  (a)  correlation  (r)  and  (b)  standard  error  (s)  of  the  estimates 
according  to  the  /C-nearest  neighbor  selection  for  2926  normal  boiling 
points  using  three  molecular  similarity  spaces. 
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Table  1 .  Symbols,  definitions  and  classifications  of  topological  parameters. _ 

Topostructural _ 

r  Information  index  for  the  magnitudes  of  distances  between  ail  possible 
pairs  of  vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix 
of  a  graph 

\°  Degree  complexity 

H''  Graph  vertex  complexity 

Graph  distance  complexity 

ic  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 

O  Order  of  neighborhood  when  IC^  reaches  its  maximum  value  for  the 

hydrogen-filled  graph 

M,  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M2  A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 

Path  connectivity  index  of  order  h  =  0-6 

\  Cluster  connectivity  index  of  order  h  =  3-6 

\pc  Path-cluster  connectivity  index  of  order  h  =  4-6 

Chain  connectivity  index  of  order  h  =  3-6 

P,,  Number  of  paths  of  length  h  =  0-10 

J  Balaban's  J  index  based  on  distance _ 

_ Topochemical _ 

loRB  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

IC,  Mean  information  content  or  complexity  of  a  graph  based  on  the  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

SIC,  Structural  information  content  for  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

CIC,  Complementary  information  content  for  1^'’  (r  =  0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 


h  b 
'Xlh 

hwb 

■Vc 

V 

“xJ 

hyV 

hyV 


Bond  path  connectivity  index  of  order  h  =  0-6 
Bond  cluster  connectivity  index  of  order  h  =  3-6 
Bond  chain  connectivity  index  of  order  h  =  3-6 
Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Vaience  path  connectivity  index  of  order  h  =  0-6 
Valence  cluster  connectivity  index  of  order  h  =  3-6 
Valence  chain  connectivity  index  of  order  h  =  3-6 
Valence  path-cluster  connectivity  index  of  order  h  =  4-6 
Balaban's  J  index  based  on  bond  types 
Balaban's  J  index  based  on  relative  electronegativities 
Balaban's  J  index  based  on  relative  covalent  radii 


Table  2.  Summary  of  principal  component  analysis  of  40  topostructural  indices  for  2926 
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TabI©  3.  Summary  of  principal  compon©nt  analysis  of  61  topoch©mical  indicos  for  2926 
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We  have  used  topological,  topochemical  and  geometrical  paramcftcrs  in  predicting:  (a)  normal  boiling  point 
of  a  set  of  1023  chemicals  and  (b)  lipophilienty  (log  P,  octanol/water)  of  219  chemicals.  The  results  show 
that  topological  and  topochemical  vari^les  can  explmn  most  of  flie  variance  in  tiie  data.  The  addition  of 
geometrical  parametos  to  tiie  models  provide  marginal  improvement  in  tiac  model’s  predictive  power.  Among 
tiie  three  classes  of  descriptors,  the  topochemical  indices  were  the  most  effective  in  predicting  properties. 


1.  INTRODUCTION 

A  contemporary  trend  in  theoretical  chemistry,  biome- 
dicinal  chemistry,  drug  design,  and  toxicology  is  the  predic¬ 
tion  of  relevant  properties  of  chemicals  using  structure- 
activity  relationships  (SARs).^”^  A  large  number  of  SARs 
published  in  recent  literature  use  parameters  which  can  be 
calculated  directly  from  molecular  structure,  as  opposed  to 
experimentally  derived  properties  or  parameters.^”^^®**^  The 
principal  motivating  factor  behind  this  trend  is  our  need  to 
know  many  properties  of  a  very  large  number  of  chemicals, 
both  for  practical  drug  design  and  hazard  assessment  of 
chemicals.^^*^'  All  these  properties  cannot  be  determined 
experimentally  due  to  limited  resources.  The  modeling  of 
the  properties  of  chemicals  using  SARs  based  on  calculated 
molecular  descriptors  has  tiie  following  three  major  com- 
ponents:'^*^^ 

1.  Optimal  representation  of  the  chemical  species  by  a 
chosen  model  object  (structure  representation). 

2.  Enumeration  of  relevant  charactrastics  of  the  model 
object  (parameterization). 

3.  Development  of  qualitative  or  quantitative  models  to 
predict  properties  using  tiie  selected  structural  diaracteristics 
(property  prediction). 

The  first  st^  in  tiie  overall  process  is  rquesentation 
(Figure  1).  The  term  molecular  structure  rejnesents  a  set  of 
nonequivalent  concepts.  There  is  no  reason  to  believe  tfiat 
when  discussing  different  topics,  c.g.,  chemical  syntiiesis, 
reaction  rates,  spectcoscojnc  transitions,  reaction  medianisms, 
and  ab  initio  calculations,  that  the  term  *hiolecular  structure” 
represents  the  same  fundamental  reality.^*^  In  fact,  the 
various  models  of  chemicals,  c.g.,  classical  valence  bond 
representation,  different  graph  theoretic  representations,  ball 
and  spoke  model  of  molecules,  minimum  energy  conforma¬ 
tion,  and  symbolic  representation  of  molecules  by  Hamil¬ 
tonian  operators,  are  nothing  but  various  representations  of 
the  same  chemical  entity.  Once  the  model  object  is  chosen, 
subsequent  processes  of  parameterization  and  property 
estimation  can  be  done  in  mote  than  one  way.  Consequently, 
the  field  of  theoretical  SAR  is  comprised  of  a  set  of  diverse 
modeling  activities. 

♦  All  correspondence  to  be  addressed  to:  Dr.  Subhash  C.  Basak,  Natural 
Resources  Research  Institute,  University  of  Minnesota,  Duluth,  Duluth,  MN 
55811 
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Figure  !•  The  processes  of  experimental  detennination  vis-a-vis 
theoretical  prediction  of  properties  from  SARs.  C  represents  the 
set  of  chemicals  and  R  the  set  of  real  numbers. 

A  convenient  method  of  representing  chemical  spedes  is 
by  means  of  molecular  grsqihs,  where  atoms  arc  icpcsented 
by  vertices  and  bonds  are  depicted  by  edges.^  Invariants 
derived  from  grs^hs  can  be  used  to  diaracterize  chemical 
structure.  When  a  molecule  is  rqjresented  by  a  simple  planar 
grsqih  which  does  not  distinguish  among  atoms  qr  bond  types, 
su<i  invariants  quantify  molecular  topology  without  being 
sensitive  to  such  inqxirtant  chemical  features  like  presence 
of  heteroatoms  or  bond  multiplicity.  Such  invariants  may 
be  termed ‘Topological”.  On  the  other  hand,  when  molecules 
arc  represented  by  graphs  which  arc  properly  wei^ted  to 
represent  heterogeneity  of  atom  types  and  bonding  pattern, 
invariants  derived  firom  such  graphs  are  chemically  more 
realistic.^  Such  invariants  have  been  found  to  be  more 
useful  as  conqiared  to  the  topological  indices.  We  call  such 
indices  *Topo^emical”  parameters,  because  they  quantify 
bofli  topology  (connectivity)  of  atoms  as  well  as  die  chemical 
characteristics  of  the  specific  molecular  structure. 

Another  set  of  descriptors  which  have  been  used  in  many 
SARs  are  the  geometrical  or  shape  parameters,  which  encode 
information  about  die  spatial  characteristics  of  atoms  in  the 
molecule.36-38 

In  practical  drug  design  and  hazard  assessment,  where  it 
is  necessary  to  carry  out  very  rapid  estimation  of  a  large 
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number  of  properties  with  no  or  very  little  empirical  input, 
SARs  based  on  topological,  topochemical,  and  geometrical 
parameters  can  be  of  practical  use.  Therefore,  in  this  paper, 
we  have  carried  out  a  comparative  study  of  topological, 
topochemical,  and  geometrical  parameters  in  estimating  (a) 
boiling  point  of  a  subset  of  the  Toxic  Substances  Control 
Act  (TSCA)  Inventory  comprising  1023  molecules  and  (b) 
lipophilicity  of  a  set  of  219  Averse  compounds,  llie  results 
are  presented  here  with  an  analysis  of  the  relative  contribu*- 
tions  of  the  three  classes  of  indices  in  the  development  of 
SAR  models. 

2.  MATERIALS  AND  METHODS 

2.L  Normal  Boiling  *Point  Database*  We  used  a  subset 
of  the  Toxic  Substances  Control  Act  (TSCA)  Inventory^^ 
for  which  measured  normal  boiling  point  values  were 
available  and  where  HBi  was  equal  to  zero.  HBi  is  a 
measure  of  the  hydrogen  bonding  potential  of  a  chemicaL 
There  were  1023  chemicals  in  the  TSCA  Inventory  which 
satisfied  these  two  criteria.  Because  of  the  large  numb^  of 
chemicals  in  this  study,  we  are  not  listing  the  data  for  these 
chemicals  in  this  paper.  An  electronic  copy  of  the  data  may 
be  obtained  by  contacting  the  authors. 

2J,.  Log  P  Database*  Measured  values  of  log  P  were 
obtained  from  CLOGP,^’  namely,  the  STARUST  group  of 
chemicals.  For  this  study,  we  used  only  chemicals  where 
HBi  was  equal  to  zero.  Also,  the  range  of  log  P  values  for 
die  purpose  of  estimation  was  restricted  to  ^2  to  5.5.  Actual 
measurements  for  log  P  beyond  this  range  have  been  shown 
to  be  problematic.^^  Table  1  provides  a  listing  of  tiie  219 
chemicals  that  met  these  conditions. 

2*3*  Calculation  of  Topolo^cal  and  Geometric  Pa¬ 
rameters*  Most  oftiietopological  .indices  used  for  property 
estimation  were  calculated  by  flie  computer  program  POL^ 
LY.^  These  indices  include  the  molecular  connectivity 
indices  developed  by  Randid^*  and  Kier  and  Hall,^  Wiener 
number,^'  and  frequency  of  padi  lengths  of  varying  size. 
Information  tiieoretic  in^ces  defined  on  the  hydrogen-filled 
and  hydrogen-suppressed  molecular  graph  were  calculated 
by  POLLY  using  the  methods  of  Basak  et  Roy  et 

ui.,^  Raychaudhury  and  Bonchev  and  Trinaj^d.^ 

The  J  indices  of  Balaban"^^’  were  calculated  using  software 
developed  by  the  authors.  The  hydrogen  bonding  parameter, 
HBi,  was  c^culated  uring  aprogram  developed  by  Basak^ 
and  is  based  on  die  ideas  of  Ou  et 

van  dcr  Waal*s  volume  (VW)  was  calculated  using  Sybyl 
6.2.^  The  3-D  Wiener  numbers^  were  calculated  using 
Sybyl  with  an  SPL  (Sybyl  Programming  l  anguage)  program 
developed  by  the  audiors.  The  calculation  ofthe  3-D  Wiener 
number  consists  of  summing  the  entries  in  the  upper 
triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3-D  coordinates  of  each  atom, 
needed  for  these  computations,  was  determined  using 
CONCORD  3.2.1.^^  For  this  paper,  two  variants  of  the  3-D 
ViTencr  numba-  have  been  calculated,  and  Wk,  where 
the  hydrogen  atoms  have  been  excluded  and  included  in  the 
calculation,  respectively. 

In  Table  2,  the  symbols  for  all  topological  and  geometric 
parameters  have  been  listed.  A  brief  definition  of  each 
parameter  is  provided  in  Table  2  as  well. 

The  parameters  in  Table  2  were  then  classified  as  being 
topological,  topochemical,  or  geometric.  Table  2  is  orga¬ 
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nized  to  show  where  each  parameter^ was  -classed.  The 
topological  parameters  consist  of  those  indices  in  wMch  atom 
specific  information  and  bonding  type  axe  ignored  in  calcula¬ 
tion  of  tire  index.  The  topochemical  indices  account  for  atom 
and  bond  type  information.  The  geometric  parameters  arc 
based  upon  3-D  coordinate  information  of  the  molecule. 

2.4*  Statistical  Analyses.  Since  the  difference  in  mag¬ 
nitude  for  the  topological  and  topochemical  indices  can  vary 
greatly,  they  were  transformed  by  die  natural  logarithm  of 
die  index  plus  one.  One  was  added  since  many  ofthe  indices 
can  be  zero.  The  geometric  parameters  were  transformed 
by  the  natural  logarithm  of  the  parameter. 

Two  regression  procedures  were  used  in  the  development 
of  models.  When  the  number  of  independent  variables  was 
bi^  typically  greater  than  25,  a  stepwise  regression 
procedure  to  maximize  improvement  to  ^  was  used.  When 
the  number  of  ind^)cndent  variables  was  small,  all  possible 
subsets  regression  was  used.  All  regression  models  were 
developed  using  procedure  REG  of  the  SAS  statistical 
package.^ 

For  both  data  sets,  we  randomly  split  the  chemicals  into 
.approximately  equal  (50%/50%)  training  and  test  sets.  For 
the  BP  data,  there  were  512  chemicals  in  the  training  set 
and  511  chemicals  in  the  test  set  ForlogP,  there  were  114 
chemicals  in  the  traimng  set  and  105  chemicals  in  the  test 
set  The  training  set  and  te^t  set  of  chemicals  are  identified 
in  Table  1  for  the  log  P  data.  Models  were  developed  using 
the  training  set  of  chemicals.  These  models  were  then  used 
to  predict  the  property  values  of  the  test  chemicals.  Rnal 
models  were  then  developed  using  the  combined  training 
and  test  set  of  chemicals. 

Initial  models  for  the  dependent  property  (BP  or  log  P) 
woe  developed  using  only  the  topological  class  of  indices. 
Once  the  best  topological  model  was  determined,  the 
topological  indices  used  in  the  model  were  added  to  the  ^ 
of  topochemical  indices.  Then  the  best  model  from  tiiis 
combined  set  of  indices  was  detranined.  Hnally,  tiie 
topological  and/or  topochemical  indices  used  in  the  best 
model  so  far  were  added  to  die  set  of  geometric  parameters, 
and  the  best  model  using  all  of  these  parameters  was 
determined. 

3.  RESULTS 

3*L  TSCA  Boiling  Point  Estimafioii*  St^vrise  regres¬ 
sion  analyses  for  BP  of  the  training  set  of  chemicals  is 
summarized  in  Table  3.  As  is  shown  in  Table  3,  the 
topological  model  using  11  parameters  resulted  in  an 
explained  variance  (fP)  of  80.8%  and  standard  error  (s)  of 
40.9  ^C.  Addition  of  the  topochemical  parameters  with  die 
11  topological  parametm  increased  die  effectiveness  ofthe 
model  significandy.  The  resulting  model  used  nine  param¬ 
eters,  two  topological  parameters,  and  seven  topoch^nical 
parameters.  This  model  had  an  PP  of  96.5%  and  s  of  17,4 
®C.  All  subsets  regression  of  the  nine  topolo^cal  and 
topochemical  parameters  retained  dius  far  and  the  three 
geometric  parametoa  resulted  in  a  ten  parameter  modeL  This 
model  included  the  nine  topological  and  topochemical 
parameters  and  the  geometric  parameter  This  model 
represented  a  slight  improvement  with  IP  of  96.7%  and  s  of 
16.8  ®C. 

Application  of  the  three  models  to  the  test  set  of  chemicals 
resulted  in  comparable  PP  and  s  and  are  listed  in  Table  3. 
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Tabic  1.  Observed  and  Estimated  Upophilicity  (Log  P,  Octanol/Watcr)  for  219  Chemicals  with  HBi  Equal  to  Zero 


no. 

chemical  name 

obs 

logP 

cst 

logP 

(eq4) 

cst 

logP 

(cq5) 

cst 

logP 

(cq6) 

no. 

chemical  name 

obs 

logP 

cst 

logP 

(cq^) 

cst 

logP 

(eq5) 

cst 

logP 

(cq6) 

V 

1 ,4-dimethy Inaphthalenc 

437 

435 

437 

4.41 

74- 

li2,4-tiichloiobenzene 

4.02 

3.65 

3.84 

3.79 

2 

cyclopropane 

1.72 

136 

0.83 

0.82 

75- 

2;i',6-pcb 

5.48 

4,89 

5.07 

5.09 

3 

3,4-dimcthylchlorobcazcnc 

3,82 

3.65 

3.68 

3.74 

76 

2-butyne 

1.46 

Z22 

Z49 

Z46 

4 

2^-diphcnyl-l,l,l-trichlorocthanc 

4.87 

4,90 

4.93 

4.99 

77 

azulene 

3,20 

339 

3.52 

3.45 

5 

2,6-dimethylnaphthalene 

431 

4.15 

434 

437 

78- 

trifluoiomethylthiobenzene 

337 

336 

Z91 

Z93 

6 

hexafluoroethane 

ZOO 

Z63 

239 

233 

79- 

23-pcb 

5,16 

4,62 

4.90 

4.89 

7** 

Liodohq>tane 

4.70 

4,04 

437 

434 

80- 

1 

1 

2.84 

3.60 

337 

338 

S'* 

allylbiomide 

1,79 

232 

Z04 

2.06 

81 

biphenyl 

4.09 

4.18 

433 

432 

9** 

l^-dunediylmq)hthalene 

438 

433 

438 

4.41 

82- 

p-xylene 

3.15 

3.45 

337 

3.42 

10 

l,8>dimethylna^thalene 

436 

431 

4.41 

4.43 

83- 

ethylene 

1.13 

0.70 

0.93 

0.97 

11- 

i;23‘trichl<m)benzene 

4.05 

3,60 

3.64 

3.63 

84 

thiophenol 

232 

3.08 

3.01 

3.04 

12- 

2>ethyltliiq3henc 

Z87 

330 

Z69 

Z73 

85- 

bromotrifluoromethane 

i.86 

Z20 

Z12 

1.97 

13 

mediylchloiide 

0.91 

0.70 

0.86 

0.79 

86 

9>mediylandiracene 

5.07 

5.07 

4.90 

4.92 

14 

y-phenylpropylfluoridc 

Z95 

3.73 

336 

339 

.87- 

1 

1 

Z42 

Z63 

Z44 

Z44 

15 

iodobcnzcnc 

335 

3,08 

3.68 

3.68 

88- 

1,4-dimethyltetrachloiocyclohexane 

4.40 

4.18 

4.03 

4.11 

16 

l-n^ylpentacfaloiocyclohcxane 

4.04 

4.18 

430 

434 

89 

pxx^pylene 

L77 

138 

139 

1.71 

17- 

ethane 

1.81 

0.70 

130 

1.47 

90 

cyclohexene 

Z86 

235 

Z72 

Z74 

18 

23'-pcb 

5,02 

4,71 

4.99 

4.97 

91- 

mediylthiobenzene 

2.74 

338 

3.02 

Z97 

19 

cyclopentane 

3.00 

2.19 

235 

Z37 

92 

mediylflu(»dc 

31 

0.70 

037 

033 

20 

^ylchloride 

1.43 

138 

1.48 

132 

93 

y-phenylpropyliodide 

3.90 

3.73 

4.06 

4.06 

21 

2-phenylthiophene 

3.74 

3.88 

4.01 

3.96 

94 

23,4^pcb 

5.42 

4.89 

5.10 

5,10 

22 

trichlorofluOTomethane 

Z53 

230 

234 

239 

95- 

fluoropentacfalorocyclohexane 

3.19 

4.18 

3.87 

3.89 

23- 

fluoroform 

0.64 

1.85 

037 

0.42 

96 

lA3,5-tetrachlorobcnzcne 

4.92 

3.91 

4.08 

4.04 

24 

dimcthyldisulfide 

1.77 

232 

137 

139 

97 

Z2'-pcb 

4.90 

4.65 

4.80 

4.82 

25- 

propane 

Z36 

138 

1.97 

2.01 

98 

l>butene 

Z40 

232 

1.96 

2,05 

26 

hexamethylbenzene 

5.11 

4.18 

4.94 

4.97 

99- 

1 3-dimcdiylnaphthalenc 

4.42 

439 

4.43 

4.46 

27 

butanethiol 

Z28 

2.80 

2.81 

2.87 

100- 

1,7-dimcdiylnaphthalcnc 

4.44 

433 

4.43 

4;45 

28- 

diethylsulEde 

1.95 

2.80 

Z68 

2.67 

101- 

1-methylnaphdialcne 

3.87 

3.95 

4.08 

4.07 

29 

cyclohexane 

3,44 

235 

Z83 

Z87 

102 

2,6-pcb 

4.93 

4,70 

4.83 

4.85 

30- 

diphenyldisulfide 

4.41 

4.62 

437 

433 

103- 

a-bromotoluene 

Z92 

338 

3.47 

3.42 

31 

m-fluorobenzylchloridc 

Z77 

335 

Z95 

Z99 

104 

23'3'-trichloiob^hcnyl 

531 

4.89 

532 

530 

32 

l-chloFopropane 

2.04 

232 

1.92 

1.97 

105 

hexafluorobenzenc 

2Z2 

4.18 

330 

2.97 

33 

2,4-dichlorobenzylchloride 

3.82 

4.01 

3.67 

3.69 

106- 

3>bromotfaiophene 

Z62 

Z49 

Z73 

2.72 

34 

m>chlorotoluene 

338 

338 

3.30 

3.34 

107- 

13»33*tetramediylbenzene 

4.17 

3.91 

437 

432 

35- 

butane 

Z89 

232 

239 

Z43 

108 

halothane 

230 

3.01 

Z16 

Z19 

36 

1,23-trimethylbenzcnc 

3.66 

3.60 

3.89 

3.93 

109 

2,4,6-pcb 

5.47 

4.97 

5.02 

5.04 

37 

1,  l-dlifluoroethylene 

134 

1.85 

0.72 

0.79 

110 

IJ-dicfaloroediylene 

Z13 

1.85 

1.89 

1.97 

38- 

l-chlorobutane 

Z64 

Z80 

Z67 

Z70 

111 

o*<libromobenzrae 

3.64 

339 

3.94 

3.88 

39 

2,3^biomothioi^ene 

333 

2.98 

332 

333 

112 

13T43*tetramediylbcnzene 

4.00 

332 

434 

439 

40- 

pentafluoiethylbenzene 

336 

334 

3.15 

3.18 

113 

14iexene 

339 

335 

3.17 

331 

41- 

lA4,5-tetrabiDmobenzene 

5.13 

3.82 

5.06 

4.94 

114- 

neopentane 

3.11 

230 

3.12 

338 

42 

o-dichlorobcnzene 

338 

339 

3.19 

3.19 

115 

chloroform 

1.97 

1.85 

2.11 

2.06 

43- 

1^3,4-tetrachlorobenzene 

4.64 

3.81 

4.01 

3.97 

116- 

1-fluorobutane 

238 

Z80 

Z15 

230 

44- 

tcibiomoethene 

330 

Z63 

337 

336 

117- 

pyrene 

438 

5.46 

4.90 

4.88 

45 

pentane 

339 

Z80 

3.01 

3.03 

118 

14-<lidiloro-2^-dq>henylethane 

431 

4,95 

4.88 

4.94 

46- 

isobutane 

Z76 

1.85 

Z61 

Z71 

119- 

isobutylene 

234 

1.85 

Z47 

Z61 

47- 

mirex 

538 

5.10 

536 

5.18 

120 

d4>henylinediane 

4.14 

4.40 

431 

434 

48- 

13-dichlOTobenzenc 

3.60 

338 

334 

333 

121 

isofxopylbenzene 

3.66 

335 

3.62 

3.67 

49- 

1,2-dimethyhii^hthalenc 

431 

435 

4.41 

4.43 

122- 

naphthalene 

330 

3.43 

338 

334 

50- 

2>cthylnq)hthalene 

438 

432 

433 

434 

123- 

l-hq)tene 

3.99 

3.86 

3.48 

330 

51- 

cyclohcptatxieoe 

Z63 

339 

Z74 

Z74 

124 

23-dhnethylbutane 

3.82 

Z88 

‘  3.45 

335 

52- 

3-chloiobiphenyl 

438 

4.42 

4.65 

4.64 

125 

l-CLuoropentane 

233 

335 

Z79 

2.82 

53- 

3-ediylthi<^ene  ' 

Z82 

330 

Z72 

Z75 

126- 

4>>xylene 

3.12 

339 

3.44 

3.49 

54 

13f5-tribiomobenzene 

431 

4.00 

434 

4.48 

127- 

ediylbenzene 

3.15 

338 

335 

336 

55- 

/?-phenylethylchloride 

Z95 

330 

331 

333 

128- 

tdchloromethyldiiobenzene 

3.78 

336 

339 

3.61 

56 

acenaph&ene 

3.92 

4.49 

3.94 

3.95 

129- 

ftiophene 

1,81 

Z19 

1.64 

1.62 

57 

m-dibiomobenzene 

3,75 

338 

4.06 

3.98 

130 

bromodiloromediane 

1.41 

138 

1.49 

1.47 

58 

dlchlorodifluoromethane 

Z16 

230 

1.88 

1.83 

131- 

l^-^lichlototetrafluoro^hane 

Z82 

Z63 

Z71 

Z65 

59 

toluene 

Z73 

3.08 

3.04 

3,05 

132- 

2-dilorob4)benyl 

438 

4,43 

4.65 

4.65 

60- 

anthracene 

4.45 

4.85 

4.62 

439 

133 

2,4' -diidilorobiphenyl 

5.10 

4.68 

4.88 

4.87 

61- 

hexachloiocyclopentadiene 

5.04 

4.00 

4.99 

4.86 

134- 

1334iid3lorobeazene 

4,15 

4.00 

3.48 

330 

62 

3-phenyl-l-chloropropanc 

335 

3.73 

336 

338 

135 

l>octene 

437 

4.04 

3.77 

3.78 

63- 

bibenzyl 

4.79 

4.62 

4.69 

4.71 

136 

mcthylbromidc 

1.19 

0.70 

133 

1.07 

64- 

l*chloFoheptane 

4.15 

4.04 

3.72 

3.71 

137- 

phenylethylsulfide 

3Z0 

330 

3.37 

3.36 

65- 

2,4-dichlOTOtolucnc 

434 

3,65 

3.60 

3.64 

138 

l-cthyl-2-mcthylbenzcne 

333 

3.54 

3.81 

3.84 

66- 

1,  l-dichloroethane 

1.79 

1.85 

1,93 

Z02 

139- 

propylbenzene 

3,72 

330 

3.56 

3.58 

67- 

O^benzothiophene 

3.12 

3.15 

334 

3.17 

140- 

indane 

3.18 

3.15 

3.06 

3.04 

68- 

2-bromothi<^henc 

Z75 

2.49 

2.62 

Z61 

141 

2-chloropropane 

1.90 

1.85 

232 

233 

69 

chlorodifluoromethane 

1.08 

1,85 

0,75 

0.75 

142- 

phenylazide 

239 

330 

2.83 

2.88 

70- 

pentachlorobenzene 

5.17 

4.02 

4.62 

4.52 

143 

2,4-dibromotetrachlorocyclohcxane 

3,98 

4.18 

4.25 

4.29 

71 

9,  lO-dihydroanthracene 

4.25 

4.85 

431 

4.34 

144- 

tetrachlorocdiylcne 

3.40 

3.01 

3,69 

3.52 

72 

1 .3-(bis-chloromethyl)bcnzene 

2.72 

3.85 

3,49 

3.51 

145 

l-noncne 

5.15 

437 

3.97 

3.98 

73 

chlorobenzene 

2,84 

3.08 

2-90 

2.90 

146 

23-dimethylbutane 

3.85 

3,01 

3,41 

3.50 

Study  of  Topological  and  Geometrical  Parameters 
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Tabic  1  (Continued) 


no. 

chemical  name 

obs 

logP 

csl 

logP 

(cq4) 

cst 
logP 
icq  5) 

cst 

logP 

{cq6) 

no. 

chemical  name 

obs 

1<^P 

est 

logP 

(cq4) 

cst 

logP 

(cq5) 

est 

logP 

(eq6) 

147- 

dichlorofluoromethane 

155 

1.85 

125 

1.30 

184- 

23,6-trimcthyhuphthalcac 

4.73 

4.46 

4.61 

4.64 

148- 

1, 1,2,2-tetrachloroethane 

239 

3.01 

Z91 

2.90 

185- 

difluorometh^e 

30 

138 

030 

0.11 

149- 

14^4-trimethy)bctj2ciic 

3.78 

3.65 

3.95 

3,98 

186 

1 ,2,4-trifluorobenzcne 

252 

3.65 

Z63 

255 

150- 

fluorobcnzenc 

121 

3.08 

239 

2.40 

187 

bromobenzenc 

2.99 

3.08 

338 

333 

151 

butylbenzene 

426 

3.73 

3.81 

3.83 

188 

hexachloro-  13-butadicne 

4.78 

436 

5.00 

4.86 

152- 

ethylbromide 

1.61 

138 

1.98 

1,95 

189 

vinylbromide 

157 

138 

1.76 

1.78 

153- 

tetrafluoiomethane 

1.18 

220 

1.61 

129 

190- 

o-chlcHX>toluene 

3.42 

339 

333 

336 

154- 

p-cymene 

4.10 

3.93 

3.88 

3.92 

191- 

a-chloiotoluene 

330 

338 

3.09 

3.10 

155- . 

p<chloiotoluene 

333 

3.45 

3.17 

322 

192 

1,4-cyclohcxadicnc 

230 

255 

Z42 

Z46 

156- 

l-bromopropane 

2.10 

222 

235 

234 

193 

l-bromoh^tane 

436 

4.04 

4.06 

4.00 

157- 

bromocyclohexane 

320 

3.08 

3.47 

3.46 

194 

styrene 

^95 

338 

3.15 

3.17 

158- 

2-methylthiophene 

233 

2.49 

239 

2.41 

195 

chloiotdfluoiomethane 

r.65 

230 

156 

1.45 

159 

diphcnylsulfide 

4.45 

4.40 

4.48 

4,47 

196- 

(dimethyl)phenylphosphine 

257 

335 

Z99 

Z92 

160- 

lA4,5-tctrachlorobenzene  4.82 

3.82 

3.93 

3.91 

197 

cycloocta-l,5-dicne 

3.16 

333 

2,88 

Z93 

161 

1,1,1-trichloroethane 

2.49 

220 

2.43 

252 

198 

tetrachlorocyclohexane 

Z82 

3.82 

350 

355 

162- 

p-dicUoiobenzene 

352 

3.45 

3.08 

3.10 

199 

1-biomooctane 

4.89 

437 

433 

4,17 

163 

l-bromobutanc 

2.75 

2.80 

3.15 

3.14 

200 

2-mediyln^hdialene 

3.86 

3*90 

4.03 

4.01 

164- 

p-chlorobiphenyl 

4.61 

450 

457 

456 

201 

3-metiiyldiiophcae 

2.34 

Z49 

2.44 

2.46 

165- 

cyclopropylbcnzene 

327 

2.98 

3.01 

3.02 

202- 

methylenechloride 

135 

138 

138 

135 

166- 

2,6-dichlon)toluene 

429 

3.60 

3.48 

354 

203- 

hexa^oibbenzene 

5.31 

4.18 

5.09 

4.93 

167- 

allene 

1.45 

138 

1.42 

1.48 

204 

indene 

Z92 

3.15 

3.01 

3.01 

168- 

b-phenylcthylbromide 

3.09 

350 

3,66 

3.64 

205 

fert-butylbenzcne 

4.11 

336 

3.92 

3.99 

169- 

13'butadiene 

1.99 

222 

1.88 

1.97 

206 

1 ,2-<iichloroethane 

1.48 

232 

1.92 

1.91 

170 

2-chlorothiophcne 

254 

2,49 

2.14 

2,16 

207- 

135-trimethylbenzcne 

3.42 

4,00 

3.81 

3.87 

171 

l-bromopentane 

3.37 

325 

3.62 

358 

208- 

phenandirene 

4.46 

4.88 

4.69 

4.68 

172- 

y-phenylpropylbromide 

3.12 

3.73 

3.87 

3.84 

209- 

benzene 

2.13 

2.55 

2.40 

2.39 

173 

1 ,3-cyclohcxadienc 

2.47 

255 

2.47 

250 

210 

333*trifluoropropylbenzcne 

3.31 

3.80 

3.19 

334 

174- 

pcntamcthylbcnzcnc 

456 

4.02 

4.65 

4.69 

211- 

a-(2,2,2-trichloroethyl)styrene 

456 

3.93 

4.04 

4.13 

175- 

p-dibromobcnzcnc 

3.79 

3.45 

3.81 

3.76 

212- 

25-<liniethylnaphthalene 

4.40 

4.20 

435 

438 

176 

1,4-pentadicnc 

2.48 

2.80 

2.32 

2.42 

213- 

13^chlOT0propane 

2.00 

Z80 

2.47 

Z47 

177- 

me&yliodide 

151 

0.70 

1.48 

1.42 

214 

l,23>4-tetrame&ylbenzene 

4.11 

3.81 

436 

431 

178- 

1,1-difluoroethane 

.75 

1.85 

1.04 

1.11 

215- 

stilbene-t 

4.81 

4.62 

4.79 

4.78 

179- 

l-bromohexanc 

3.80 

3.86 

3.80 

3.75 

216 

fluorene 

4.18 

4.65 

432 

431 

180- 

TO-xylcnc 

320 

358 

3.44 

3.49 

217- 

2-fluorO‘3-biomotctiacblorocycldiexane 

338 

4.18 

4.06 

4.09 

181 

dibaizothiophene 

438 

4.65 

4.44 

4.40 

218- 

allylbenzene 

333 

350 

337 

3.41 

182 

ethyliodide 

2.00 

138 

228 

234 

219- 

carbontetrachloride 

Z83 

230 

337 

3.10 

183 

trifluoromethylbcnzcac 

3.01 

326 

2.n 

Z80 

*  Training  chemicals. 


The  largest  difference  in  variance  explained  was  for  the 
topological  parameter  model.  For  diis  model,  2?^  decreased 
from  80.8%  to  79.5%  or  1.3%  less  variance  explained. 
However,  die  standard  error  for  the  test  chemicals  was  0.1 
lower.  For  the  other  two  models,  the  of  die  test 
chemicals  was  within  0.6%  of  diat  seen  for  the  training 
diemicals.  Standard  enors  for  die  test  chemicals  were  within 
1  of  die  standard  error  for  the  training  set  of  diemicals. 

Regression  analysis  of  die  set  of  training  and  test  chemicals 
combined  showed  similar  results  as  analysis  of  die  training 
set  of  chemicals.  Using  only  the  topological  class  of  indices, 
stepwise  regression  resulted  in  an  eight  parameter  model  to 
estimate  boiling  point 

BP  =  -21.9  +  30.6(W)  -  21.5(0)  +  69.9(^;t)  + 

35.8(®x)  -106.5(®zc)  -  96.1(®Zq>)  "  17.7  + 

19.5(Pio)  (1) 

n  =  1023,  =  5  =  39.7®C.  F  =  547 

These  eight  parameters  were  added  to  the  set  of  to- 
pochemical  parameters.  Again,  stepwise  regression  was  used 
to  develop  a  model  using  the  eight  topological  and  all 
topochcmical  indices.  The  best  model  to  estimate  boiling 


point  consisted  of  eight  parameters  again: 

BP  =  -332.9  +  134.6(‘x)  +  10.9(Pio)  +  110.0aCo)  “ 

133.8(«z'*)  -  80.2(Vc)  +  176.5(®z')  +  44.8(V)  + 

16-8(Vpc)  (2) 

n=1023,  /?*  =  96.1%,  «=18.0®C,  i?  =  3151 

Only  two  of  the  topological  indices  used  in  eq  1  were 
retain^  by  the  regression  procedure  in  eq  2:  ^  and  Pio. 
The  inqirovement  in  was  very  significant,  going  tcom 
81.2%  for  eq  1  to  96.1%  for  eq  2.  Also,  die  model  error 
decreased  by  over  half,  dropping  from  39.7  ®C  to  18.0 
Using  all  subsets  regression  on  the  eight  parameters  of  eq 
2  and  the  three  geometric  parameters  resulted  m  a  ten 
parameter  model  as  follows: 

BP  =  -285.7  +  125.3(^z)  +  10.6(P,o)  +  74.5(ICo)  - 
125.0(®;t'^  -  86.3(yc)  +  175.3(°f  )  +  49.1(V)  + 
18-7(Vpc)  -  9.1(®Wh)  +  8.1(^''W)  (3) 

n  =  1023,  if  =  96.3%,  i  =  17.6  ®C,  F  =  2650 

Equation  3  contains  all  of  the  parameters  from  eq  2  plus 
the  two  variants  of  the  3-D  Wiener  number.  The  addition 


Basak  et  au 


1058  J.  Chenu  Inf.  Comput.  ScL,  VoL  36,  No.  6,  1996 


Tabic  2.  Symbols,  Definitions,  and  Classifications  of  Topological  and  Geometrical  Parameters _ •  -  -- _ 

Topological 

information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a  graph 
mean  information  index  for  the  magnitude  of  distance 
y/  Wiener  index  =  half-sum  of  tiie  off-diagonal  elements  of  the  distance  matrix  of  a  graph 
P  degree  complexity 

fp  graph  vertex  complexity 

IP  graph  distance  complexity  — 

^  information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences  of  distance  ^ 

O  order  of  nmghborfaood  ^en  ICr  reaches  its  maximum  value  for  the  hydrogen-filled  graph 

Mi  a  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M2  a  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring  (connected)  vertices 

^  path  connectivity  index  of  order  h  =  0—6 
cluster  connectivity  index  of  order  h  =  3—5 
path-cluster  connectivity  index  of  order  h  =  4—6 
chain  connectivity  index  of  mder  h  =  5—6 
Ph  number  ofpathsoflcngth  A  =  0-10 

J  Balaban’s  J  index  based  on  distance 

T<^>ochemical 

/pRB  information  content  or  complexity  of  the  hydrogen-stqjpresscd  grsqih  at  its  maximum  neighborhood  of  vertices 

ICr  mean  information  content  or  complexity  of  a  graph  based  on  the  rth  (r  =  0—6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
SICr  structural  information  content  for  rth  (r  =  0—6)  order  ndghboihood  of  vertices  in  a  hydrogen-filled  graph 

CTCr  complementary  information  content  for  rdi  (r  =  0—6)  order  neighborhood  of  verdees  in  a  hydrogen-filled  greqih 
V  bond  path  connectivity  index  of  order  k  =  0—6 
yc  bondclustcrconnectivity  indexof  order  A  =  3— 5 

^ch  bond  chain  connectivity  index  of  order  h  =  5—6 
^pc  bondpath-clusterconnectivity  indexof  order  A  =  4— 6 

valence  path  coimectivity  index  of  order  h  =  0—6 
valenceclustcrcoimectivity  indexof  order  A  =  3— 5 
valence  chain  coimectivity  index  of  order  h  =  5—6 
valence  path-cluster  connectivity  index  of  mder  h  =  4—6 
P  Balaban's/ index  based  on  bond  types 

P  Balaban’s  J  index  based  on  relative  electronegativities 

P  Balaban’s /index  based  on  relative  covalent  radii 


Geometric 


Vw  van  der  Waal’ s  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
^Wh  3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 


Table  3.  Summary  of  Regiessioii  Results  for  tiie  Training  Set  of  Chemicals  and  Predictions  of  Test  Set  of  Chemicals  for  Dependent  Variable 
BP  (®Q  for  Three  Parameter  Classes 

parameter  class 

trmning  set  (V= 512) 

test  set  (V= 

511) 

variables  included 

F 

R2 

5 

IP 

s 

topological 

IC,  0.  Ml,  *Xf  ^c.  Pio.  J 

191 

80.8 

40.9 

793 

40.8 

topological  +  topochemical 

*x.  Pw.  IC#,  V.  Vc.  V.  V.  Vc.  Yk 

1547 

963 

17.4 

96.0 

18.0 

U^logical  +  tc^Kxhemical  +  geometric 

%  Pio.  ICo.  V.  Vc.  V.  V.  Vc.  Vpc.  ^Wb 

1486 

96.7 

16.8 

96.1 

17.7 

Table  4*  Summary  of  Regression  Results  for  tiie  Training  Set  of  Chemicals  and  Predictions  of  Test  Set  of  Chemicals  for  Dqpendent  Variable 
Log  P  for  Three  Parameter  Classes  _ 


parameter  class 

training  set  (V  = 

114) 

test  sct*(A^  =  105) 

variables  included 

F 

IP 

s 

IP 

s 

topological 

IDW,  s*.  *Xc.  ‘Xtt.  ^fc.  P7.  P» 

40.7 

77.9 

037 

73.8 

0.60 

topological  +  topochemical 

*ZQ.Vc.sic,,y.Vpc.V.J^ 

122.6 

89.0 

0.40 

8S.6 

0.45 

topological  +  topochemical  +  geometric 

Vc.  SIC„  Vpc.  V.  “W 

123.0 

89.0 

0.39 

853 

0.45 

of  the  two  3D-WieiiCT  numbers  resulted  in  only  a  very  slight 
increase  in  the  predictive  power  of  the  modeL  Thestandard 
error  (s)  decreased  by  only  0.4  ®C  vdth  the  addition  of  the 
geometric  parameters  and  increased  fiom  96.1%  to  963%, 
an  increase  of  only  0.2%  of  the  variance  explained  by  cq  2 
over  cq  3-  A  scatterplot  of  observed  boiling  point  vs 
estimated  boiling  point  using  eq  3  is  shown  in  Hgurc  2. 

33.  Log  P  Estimation.  Stepwise  regression  analyses  for 
log  P  of  the  training  set  of  chemicals  is  summarized  in  Table 
4.  The  topological  parameter  model  included  nine  variables. 
These  nine  variables  explained  77.9%  of  the  variance  with 


a  standard  error  of  0.57.  Regression  analysis  of  tiiesc  nine 
topological  parameters  and  the  topochemical  parameters 
resulted  in  a  better  model  with  only  seven  parameters.  This 
model  included  two  topological  parameters  and  five  to- 
pochemicaL  The  increased  firom  77.9%  to  89.0%  and  s 
decreased  from  0.57  to  0.40.  Adding  the  geometric  param¬ 
eters  provided  a  very  minor  increase.  For  tiiis  model, 
replaced  y ,  the  R^  remained  the  same,  and  s  decreased  fiom 
0.40  to  0.39. 

Application  of  these  models  to  the  test  set  of  chemicals 
resulted  in  slightly  decreased  variance  explained  and  slightly 


Study  of  Topouxsical  and  Geometrical  Paramcters 


Estimated  BP 
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Figure  2.  Scatterplot  of  observed  boiling  point  ms  estimated  boiling 

point  using  cq  3  for  1023  diverse  chemicals. 

increased  standard  error.  All  F?  for  flie  test  set  differed  by 
no  more  than  4.1%  of  the  seen  for  the  training  set  The 
standard  error  of  the  test  set  of  chemicals  was  within  0.06 
of  the  standard  error  of  die  training  set  These  results  can 
be  seen  in  Table  4. 

As  with  the  BP  data  set,  regression  analyses  of  the 
coTnbinftH  training  and  test  sets  was  similar  to  the  analyses  _ 
of  the  training  sets.  Starting  widi  topological  parameters 
only,  the  following  seven  parameter  model  was  developed 
to  estimate  log  P: 

log  P  =  -1.42  +  1.08(W)  -  1.58(^z)  + - 

0.92(^%c)  “  0.32(P7)  +  020(P,o)  +  1.97(1)  (4) 

n  =  219.  /^  =  78.9%,  5  =  0.54,  F=112 

The  seven  parameters  of  cq  4  were  added  to  die  set  of 
topochemical  indices,  and  a  new  model  was  developed  using 
stepwise  regression.  This  new  model  consisted  of  ten 
parameters: 

log  P  =  -2.13  -  0.20(^X)  +  0.18(Pio)  -  1.86(ICo)  + 
133(CiC2)  -  0.92(aC3)  -  136(*z**)  +  5.76(®j:'')  - 
2.98(V)  +  O.SACx')  -  0.39(Vc)  (5) 

n  =  219,  F^  =  90.8%,  5  =  0.36,  F  =  206 

As  with  the  boiling  point  models,  only  two  of  the 
topological  parameters  were  retained  in  cq  5,  ^  and  Pio. 
Also,  just  like  the  boiling  point  models,  the  addition  of  the 
topochemical  parameters  resulted  in  a  significant  increase 
in  the  quality  of  log  P  estimation. 

All  subsets  regression  using  the  ten  parameters  of  cq  5 
and  the  geometric  parameters  resulted  in  the  following  1 1 
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Figure  3.  Scatterplot  of  observed  log  P  vs  estimated  log  P  using 
cq  6  for  219  diverse  chemicals. 

parameter  model: 

log  P  =  -5.60  +  0.19(Pio)  -  1.46aCo)  +  1.09(CIC2)  - 
0.77(003)  -  136(®%'^  +  5.34(°/)  -  3.41(V)  + 
0.55(*%’^  -  0.41(Vc)  +  l-10(Vw)  -  0.17(’°W)  (6) 

/I  =  219,  F^  =  91.2%,  5  =  0.35,  F=194 

Equation  6  differs  ffom  cq  5  with  the  removal  of  ^  and 
the  addition  of  VW  and  The  addition  of  die  geometric 
parameters  resulted  in  only  slight  improvement  in  the  ability 
to  estimate  log  P. 

Estimated  log  P  values  using  cqs  4—6  may  be  found  in 
Table  1.  Hgure  3  shows  a  scatterplot  of  observed  log  P  vs 
estimated  log  P  using  cq  6. 

4.  DISCUSSION 

The  objective  of  this  paper  was  to  carry  out  a  comparative 
study  of  the  effectiveness  of  topological,  topochemical,  and 
geometrical  parameters  in  S  AR.  To  this  end,  wc  used  these 
fliree  classes  of  parameters  in  predicting  normal  bdling  point 
of  a  diverse  set  of  1023  chemicals  and  log  P  of  a  set  of  219 
chemicals.  To  further  assess  the  utility  of  these  models  for 
predictive  purposes,  the  data  sets  were  split  into  training  and 
test  sets  by  randomly  assigning  chemicals  to  one  or  tiic  other. 
Models  developed  using  the  training  sets  of  chemicals  were 
used  to  predict  the  relevant  property  of  the  test  chemicals. 

As  can  be  seen  in  Tables  3  and  Table  4,  the  models 
developed  using  the  training  sets  of  chemicals  could  predict 
BP  and  log  P  of  the  test  chemicals  as  accurately  as  they 
could  estimate  these  properties  for  the  training  chemicals. 
Therefore,  it  seemed  reasonable  to  combine  the  training  and 
test  sets  to  develop  the  regression  models. 
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Both  for  boiling  point  and  log  P,  topological  variables  gave 
a  reasonable  predictive  model.  The  addition  of  topochemical 
parameters  to  the  set  of  independent  variables  resulted  in 
substantial  improvement  in  model  performance.  Further 
addition  of  geometrical  variables  gave  slight  improvement 
in  explained  variance  in  these  data. 

Our  modeling  approach  in  this  paper  was  a  hiciarchical 
one,  beginning  with  parameters  derived  from  the  simplest 
(topological)  representation  of  molecules.  Such  indices  are 
derived  from  simple  grs^hs  which  are  unweighted  and, 
consequently,  do  not  represent  the  reality  of  chemicals  very 
welL  The  next  tier  of  variables,  topochemical  indices, 
quantify  information  both  about  topology  as  well  as  atom 
types  and  bonding  pattern,  Rnally,  geometrical  or  3-D 
parameters  were  used  for  modeling.  The  results  show  that 
the  addition  of  chemical  information  makes  a  substantial 
contribution  to  the  predictive  power  of  the  models  for  both 
boiling  point  and  log  P.  It  would  be  interesting  to  see 
whether  this  trend  is  valid  for  other  properties. 
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Numerous  quantitative  structure— activity  relationships  (QS  ARs)  have  been  developed  using  topostructu^, 
topochemical,  and  geometrical  molecular  descriptors.  However,  few  systematic  studies  have  been  earned 
out  on  the  relative  effectiveness  of  these  tinee  classes  of  parameters  in  predicting  propicrtics.  We  have 
carried  out  a  systematic  analysis  of  the  relative  utihty  of  the  three  types  of  structural  descriptors  in  developing 
QSAR  models  for  predicting  vapor  pressure  at  STP  for  a  set  of  476  diverse  chemicals.  The  hierar^cal 
technique  has  proven  to  be  useful  in  jlhiniifiafing  tiic  relationships  of  different  types  of  molecular  description 
infonnation  to  physicochemical  property  and  is  a  use^  tool  for  limiting  the  number  of  independent  variables 
in  linear  regression  modeling  to  avoid  the  problems  of  chance  correlations. 


1.  INTRODUCTION 

.  A  large  number  of  quantitative  structure— activity  relation¬ 
ship  (QSAR)  studies  have  been  reported  in  recent  literature 
using  ieoretical  molecular  descriptors  in  predicting  physi¬ 
cochemical,  pharmacological,  and  toxicological  properties 
of  molccules.^“^^  Such  descriptors  comprise  grs^h  invari¬ 
ants,  geometrical  or  3-D  parameters,  and  quantum  chemical 
indices.  One  of  the  reasons  for  the  current  upsurge  of  interest 
is  the  fact  that  such  descriptors  can  he  derived  algorithmi¬ 
cally,  Le.,  can  be  computed  for  any  molecule,  real  or 
hypothetical,  using  standard  software.  Both  in  pharmaceuti¬ 
cal  drug  design  and  in  risk  assessment  of  chemicals,  one 
haft  to  evaluate  potential  biolo^cal  effects  of  chenucals. 
Bvaluation  schemes  based  on  property— property  correlation 
paradigms  are  not  very  useful  in  practical  situations,  because, 
for  most  of  the  candidate  structures,  the  e^qperimental  data 
necessary  for  proper  evaluation  are  not  available.  This  is 
especially  true  for  the  thousands  of  chemicals  rapidly 
pr^uced  by  metiiods  of  combinatoric  chemistry'^  as  well 
as  for  tiie  large  number  of  chemicals  present  in  tbe  Toxic 
Substances  Control  Act  (TSGA)  Iiivcntory.'^ 

A  large  number  of  physicochemical  and  biological  end¬ 
points  are  necessary  for  estimating  die  ccotoxicolopcal  fate, 
transport,  and  effects  of  environmental  ppllutants.'^’"^®  The 
vapor  pressure  of  chemicals  is  important  in  determining  the 
partitioning  of  ^emicals  among  different  phases  once  tiiey 
are  released  in  tile  environment  Many  QSARs  have  been 
reported  for  predicting  normal  vapor  pressure  of  chemicals. 
Such  studies  are  usually  carried  out  on  small  sets  of 
congeneric  chemicals.  Al^,  many  QSARs  use  experimental 
data  as  inputs  in  the  model  Therrfore,  it  becomes  necessary 
to  develop  QSARs  based  on  noncmpirical  parameters  which 
can  predict  tiie  vapor  pressure  for  a  heterogeneous  collection 
of  chemicals  so  that  such  models  arc  generally  applicable. 
With  this  end  in  mind,  in  the  current  paper  we  have  carried 
out  a  QSAR  study  of 476  diverse  chemicals  using  three  types 
of  noncmpirical  molecular  descriptors. 
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Z  MATERIALS  AND  METHODS 

Zl.  Normal  Vapor  Pressure  Database.  Measured 
values  for  a  subset  of  the  Toxic  Substances  Q)ntrol  Act 
fTSCA)  Inventory^^  were  obtained  from  the  ASTER  (As¬ 
sessment  Tools  for  tiie  Evaluation  of  Risk)  database.^  This 
subset  consisted  of  a  diverse  set  of  chemicals  where  vapor 
pressure  (pvap)  was  measured  at  25  ®C  and  over  a  pressure 
rangeofappixixirnately  3-10  000  mmHg.  Due  to  the  size 
of  the  dataset  being  used  in  this  study,  data  for  these 
chemicals  will  not  be  listed  in  tiiispspcr.  An  electronic  copy 
of  the  data  may  be  obtained  by  contacting  the  authors. 

2tZ  Computation  of  Topological  Indices.  The  majority 
of  tiie  topolo^cal  indices  (TIs)  used  in  this  study  have  been 
calculated  by  the  computer  program  POLLY  23?^  These 
indices  include  Wiener  index,^  the  molecular  connectivity 
indices  developed  by  Randid  and  Kier  and  Hall,^  informa¬ 
tion  tilieoretic  indices  defined  on  distance  matrices  of 
graphs,^*^  and  a  set  of  parameters  derived  on  the  neighbor¬ 
hood  complexity  of  vertices  in  hydrogen-fiUed  molecular 
grjqihs.^"^  Balaban’s/indices^^^  were  calculated  using 
software  developed  by  the  authors. 

van  dcr  Waal’s  volume  (Vw)^“^  was  calculated  using 
Sybyl  6.2.^  The  3-D  Wiener  numbers^  were  calculated  by 
Sybyl  uring  an  SPL  (Sybyl  Programming  Language)  program 
developed  by  the  authors.  Calculation  of  3-D  Vfiener 
numbm  consists  of  the  sununation  of  tiie  entries  in  tiie  upper 
triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3-D  coordinates  for  the  atonu 
were  determined  using  CONCORD  3.2.1.^  Two  variants 
of  the  3-D  Wiener  number  were  calculated,  ^Wu  and 
where  hydrogen  atoms  arc  included  and  excluded  ftom  the 
computations,  respectively. 

Table  1  provides  a  conqilete  listing  of  all  of  the  topological 
and  geometrical  parameters  which  have  been  used  in  this 
study.  The  listing  includes  the  symbols  used  to  represent 
the  parameters  and  brief  definitions  for  each  of  the  param¬ 
eters. 

Two  additional  parameters  were  used  in  modeling  normal 
vapor  pressure,  HBi,  and  dipole  moment  (p).  HBi  is  a 
simple  hydrogen  bonding  parameter  calculated  using  a 
program  developed  by  Basak,^*  which  is  based  on  the  ideas 
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Tabic  Symbols  and  Definitions  of  Topological  and  Geometrical 
Parameters 
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IP 
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Vc 
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Vfc 

hyV 

Vc 

Vch 

Vic 

Ph 

J 
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Vw 
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information  index  for  the  magnitudes  of  distances 
between  all  possible  pairs  of  vertices  of  a  graph 
mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements 
of  the  distance  matrix  of  a  grz^h 
degree  complexity 
graph  vertex  complexity 
graph  distance  complexity 

information  content  of  the  distance  matrix  partitioned  by 
frequency  of  occurrences  of  distance  h 
information  content  or  con^lexity  of  the  hydrogen- 
stq)pressed  graph  at  its  maximum  neighborhood  of 
vertices 

Older  of  nei^borhood  when  IQ  reaches  it  maximum 
value  for  the  hydrogen-filled  graph 
aZagreb  group  parameter  =  sum  of  square  of  degree 
over  all  vertices 

a  2^agreb  giou|>  parameter = sum  of  cross-product  of 
degrees  over  all  ndghboring  (connected)  vertices 
mean  information  content  or  complexity  of  a  ffzph 

based  on  the  = 0-5)  order  nei^borhood  of  vertices 

in  a  hydrogen-filled  graph 

structural  information  content  for  rth  (r = 0—5)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  gr^h 
complmentary  information  content  for  rth  (r  =  0—5) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
path  connectivity  index  of  order  A  =  0—6 
cluster  connectivity  index  of  order  h  =  3—6 
path-cluster  cotmectivity  index  of  order  h  =  4—6 
chain  cormectivity  index  of  order  =  5, 6 
bond  patii  cormectivity  index  of  order  A  =  0—6 
bond  cluster  connectivity  index  of  order  h  =  3—6 
bond  chain  connectiidty  index  of  order  A  =  5, 6 
bond  path-cluster  cormectivity  index  of  order  h  =  4—6 
valence  patii  connectivity  index  of  order  h  =  0—6 
valence  cluster  connectivity  index  of  order  h  =  3—6 
valence  chain  connectivity  index  of  order  A  =  5, 6 
valence  path-cluster  connectivity  index  of  order  h  =  4—6 
number  of  paths  of  length  =  0—10 
Balaban’s  J  index  based  on  distance 
Balaban's  J  index  based  on  bond  types 
Balaban's  J  index  based  on  relative  dectronegativities 
Balaban's  J  index  based  on  relative  covalent  radii 
van  der  Waal's  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed 
geometric  distance  matrix 

3-D  Wiener  number  for  the  hydrogen-filled  geometric 
distance  matrix 


of  Ou  et  al?^  Dipole  moment  was  calculated  using  Sybyl 

6.2.35 

23.  Data  Reduction.  The  set  of  92  TIs  was  partitioned 
into  two  distinct  subsets:  topostructural  indices  and  to- 
pochemical  indices.  The  distinction  was  made  as  follows: 
topostnictuial  indices  oicode  information  about  the  adjacency 
and  distances  of  atoms  (vertices)  in  molecular  structures 
(graphs)  irrespective  of  die  chemical  nature  of  the  atoms 
involved  in  die  bonding  or  factors  like  hybridization  states 
of  atoms  and  number  of  corc/valence  electrons  in  individual 
atoms,  while  topochemical  indices  quantify  information 
regarding  the  topology  (connectivity  of  atoms)  as  well  as 
specific  chemical  properties  of  the  atoms  comprising  a 
molecule.  Topochemical  indices  arc  derived  fix)m  weighted 
molecular  gr^hs  where  each  vertex  (atom)  is  properly 
weighted  with  selected  chcmical/physical  properties.  These 
subsets  are  shown  in  Tabic  2. 

The  partitioning  of  the  indices  left  38  topostructural  indices 
and  54  topochemical  indices.  At  this  point  no  further  data 
reduction  is  called  for,  since  the  ratio  of  the  number  of 


Table  2.  Classification  of  Parameters  used  in-ModeHng  Nonnal 
Vapor  Pressure  PogioCpvp)] _ _ 
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•observations  in  the  training  set  (342)  to  the  total  number  of 
variables  (92  maximum)  falls  well  within  the  condition  limits 
suggested  by  Topliss  and  Edwards^  for  reducing  the 
probability  of  spurious  correlations  even  at  the  more  . 
conservative  >  0.7  level. 

2.4.  Statistical  Analysis  and  Hierarchical  QSAR. 
Initially,  all  TIs  were  transformed  by  the  natural  logarithm 
of  the  index  plus  one.  This  was  done  since  the  scale  of  some 
indices  may  be  several  orders  of  magnitude  greater  than  that 
of  other  indices.  The  geometric  parameters  were  transformed 
by  the  natural  logarithm  of  the  parameter. 

Two  regression  procedures  were  used  in  developing  the 
linear  models.  When  the  number  of  independent  variables 
was  high,  typically  greater  than  25,  a  stepwise  regression 
procedure  was  used  to  maxinuze  the  improvement  of  the 
explained  variance  (i?^).  When  the  number  of  independent 
variables  was  smaller,  all  possible  subsets  regression  was 
used.  Models  were  then  optimized  to  reduce  problems  of 
variance  inflation  and  collinearity.  Regression  modeling  was 
conducted  using  the  REG  procedure  of  the  statistical  package 
SAS.^^ 

The  V2qx)r  pressure  data  (pv«p)  was  split  into  a  training  set 
(342  compounds)  and  a  test  set  (134  compounds),  an 
approximately  75/25  split  Models  were  developed  using 
the  training  set  of  chemicals  and  dien  used  to  predict  the 
Pv«p  values  of  thp  test  chemicals.  Hnal  models  were  then 
developed  using*  die  combined  training  and  test  set  of 
chemicals. 

Hve  sets  of  indices  were  used  in  model  development 
These  sets  were  constructed  as  part  of  a  hierarchical  qjproach 

to  QSAR  modeling.  Thehicrarchy  begins  with  the  simplest 

indices,  die  topostiucturaL  After  dcveloiring  our  initial  model 
utilizing  the  topostructural  indices,  we  increase  the  level  of 
conqplexity.  To  the  indices  included  in  the  best  topostructural 
model,  we  add  all  of  the  topochemical  indices  and  proceed 
to  model  pvip  using  these  parameters,  likewise,  the  indices 
included  in  the  best  model  fipom  diis  procedure  arc  combined 
widi  the  geometrical  indices  and  modeling  is  conducted  once 
again  Tn  addition  to  this  hierarchical  jqjproach,  models 
also  constructed  using  the  topochemical  indices  alone  an 
the  geometrical  indices  alone  for  purposes  of  comparison. 

3.  RESULTS 

Stepwise  regression  analyses  for  logio(pv«p)  of  the  training 
set  of  chemicals  is  summarized  in  Table  3.  As  shown 
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Table  3,  Summary  of  the  Regression  Results  for  tiie  Training  Set  and  the  Prediction  Results  for  die  Test  Set  for  the  Hietarchida  Analysis  of 

logio(Pv«p) _ _ _ _ _ _ _ _ _ 

training  set  (N  =  342)  _  test  set  (N=  134) 

parameter  class  variables  included  F  ^  _ ^  s 


topostructural 

topochemical 

geometrical 

topostructural  +  topochemical 
all  indices 
ttg  +  HBi 


SICo,  SIC2.  SIC5,  aCo,  aCu  Vo  V,  V.  Vo 

P9,  ICi.  SlCi,  CLCu  Vo  V.  V.  V.  Vo  Vch 
siCi,  SIC2,  aco,  acs,  ^o  V.  V.  Vo  Pw 

P3,  P,,  ICo,  V  Vo  V.  V  Vo  HB I 


test  set  (N  = 

134) 

F 

s 

s 

104.6 

48.1 

036 

57.9 

0.46 

1263 

793 

036 

85.8 

037 

168.9 

51.8 

033 

623 

0.44 

1123 

80.4 

0.35 

84.7 

038 

117.4 

79.6 

035 

843 

038 

160.8 

82.9 

032 

83.1 

039 

Tabic  3,  the  topostructural  model  using  flucc  parameters 
resulted  in  an  explained  variance  (P^)  of  48.1%  and  a 
standard  error  (j)  of  0.56.  Addition  of  the  topochemical 
parameters  to  the  three  topostructural  parameters  led  to  a 
significant  increase  in  the  effectiveness  of  the  modeL  The 
renting  model  used  12  parameters,  two  topostructural  and 
ten  topochemical.  This  model  had  an  of  80.4%  and  s  of 
0.35.  All  subsets  regression  of  the  two  topostructural  and 
ten  topochemical  indices  retained  thus  far  and  the  three 
geometrical  indices  resulted  in  the  selection  of  the  same  12 
parameter  model,  thus  the  geometrical  indices  did  not 
contribute  significantly  to  model  development  Several  otiicr 
models  were  constructed  for  comparative  purposes.  Using 
topochemical  indices  only,  a  ten  parameter  model  was 
developed  which  had  an  RT’  of  79.2%  and  s  of  0.36.  •  A 
geometrical  model  was  developed  which  utilized  all  three 
geometrical  indices  and  resulted  in  an  of  51.8%  and  s  of 

0.53.  Finally,  two  additional  stepwise  models  were  devel¬ 
oped.  One  model  simply  used  all  indices  for  a  comparison 
between  a  simple  stqiwise  analysis  of  tiie  data  and  the  results 
of  the  hierarchical  procedure.  This  resulted  in  an  11 
parameter  model  withi?^  of  79.6%  and  s  of 035.  The  second 
model  added  two  new  parameters,  HBi  and  fi.  We  thought 
that  it  might  be  possible  to  improve  our  modeling  by  adding 
in  some  other  nonempirical  parameters  which  could  be 
important  to  the  determination  of  normal  vapor  pressure.  We 
selected  the  parameters  HBi  and  /t,  since  they  would  be 
important  in  intcrmolecular  interactions  which  could  have  a 
dramatic  effect  on  vj^>or  pressure.  To  look  at  the  addition 
of  these  parameters,  we  conducted  a  stepwise  regression 
analysis  using  all  topostructural,  topochemical,  and  geometric 
indices  so  that  we  would  be  able  to  optimize  our  model, 
just  as  we  had  done  witii  the  previous  models.  The  addition 
of  these  parameters  led  to  the  selection  of  a  ten  parameter 
model  wUch  included  three  topostructural  indices,  nine 
topochemical  indices,  and  HBi.  This  was  tiic  best  model 
yet,  with  an  of  82.9%  and  s  of  032. 

Application  of  these  ax  models  to  the  test  set  of  chemicals 
resulted  in  comparable  and  si  actually  all  models  improved 
slightly  on  thdr  predictions  of  the  test  set,  and  these  values 
are  also  listed  in  Table  3.  Based  on  these  results,  we  detrided 
that  it  was  pointless  to  develop  further  models  using  only 
geometrical  parameters.  Also,  based  on  the  findings  that 
the  geometrical  indices  did  not  contribute  significantly  to 
any  of  the  training  models,  they  were  dropped  from  the 
development  of  final  models  for  tiie  full  set  of 476  chemicals. 
However,  even  though  the  topostructural  indices  did  not 
perform  well  in  modeling  v^wr  pressure  by  tiicmselves,  they 
will  be  used  in  model  development  since  tiiey  did  contribute 
significantly  to  most  of  the  models. 

Regression  analyses  of  the  combined  set  of  476  chemicals 
showed  similar  results  for  estimating  logio(Pvap)  as  analysis 


of  the  training  set  Using  only  the  topostcuctural  indices, 
stepwise  regression  analysis  resulted  in  a  five  parameter 
model  to  estimate  vapor  pressure: 

logio(Pv^  =  4.88  +  020(0)  -  2.56('x)  +  0A9(*Xc)  + 

0.79te  +  0-98(Pio)  (1) 

«  =  476,  i?^  =  51.5%.  ^  =  0.53,  F  =  99.7 

Stepwise  regression  using  the  five  topostmctural  param- . 
eters  and  all  topochemical  parameters  resulted  in  tiie  selection 
of  the  following  seven  parameter  model: 

logio(Pvap)  =  8-44  -  1.776)  +  1.25(Pio)  -  5.69(IC,)  + 

3.9ia:Cj)  -  1.24aC5)  +  1.41(yc)  -  1-70(V)  (2) 

n  =  476,  /?*  =  79.3%,  5  =  0.34,  F  =  224.0 

Only  two  of  the  topostructural  indices  used  in  eq  1  were 
retain^  by  the  stepwise  regression  procedure  used  to  produce 
cq2;  The  improvement  in  was  significant, 

increasing  fiom  513%  for  cq  1  to  79.3%  for  cq  2.  Also, 
the  model  error  decreased  significantly,  dropping  by  0.19 
logarithmic  units.  Since  we  have  dropped  the  geometrical 
indices,  this  becomes  our  final  hierarchical  modeL 
The  stepwise  regression  analysis  of  only  topochemical 
parameters  resulted  in  a  12  parameter  model: 

logio(P.^  =  6.65--3.44aC^-133aCs)^ 

3.47(SIC2)  +  0.87(aCi)  -  0.48(*/)  +  1.44(yc)  “ 

1-006’^  -:o.4i(V)  -  0.70(V)  -  1.08(Vc)  + 

1.426  a) -1-23(1^)  (3) 

n  =  476,  iJ^  =  75.8%,  5  =  0.38,  F=  120.5 

This  model  which  is  inferior  to  the  topostructural  + 
topochemical  model  (cq  2),  because  its  variance  explained 
is  lower  and,  more  importantly,  it  requires  more  independent 
variables  Opatameters)  to  achieve  tiiis  explanation  of  variance. 

Stepwise  regression  of  all  indices  resulted  in  the  selection 
of  an  11  parameter  model.  This  approach  selected  three 
topostructural  indices  and  eight  topochemical  indices  to  arrive 
at  the  following  model: 

logio(Pv,p)  =  7.85  -  2.56(H^)  +  1.176c) " 

5.01(IC,)  +  3.65aC2)  -  0.99aC5)  +  0.51(aCi)  - 

1.546'')  -  0.366'')  -  0.366'')  "  1.406''c)  (4) 

n  =  476,  F^  =  80.4%.  5  =  0.33,  F=  173.4 


654  y.  Chem.  Irtf.  Comput,  ScL,  VoL  37,  No.  4,  1997 


Estimated  log,o(p„p) 


Log.o(P».p) 

Figure  1.  Scatteqilot  of  observed  logio(Pvip)  vs  estimated  logi<r 
(Pvip)  using  eq  5  for  476  diverse  compounds. 


While  cq  4  shows  some  slight  improvements  over  cq  2, 
the  hierarchical  model,  eq  2  is  preferred  since  it  is  a  simpler 
model  using  seven  indices  instead  of  11  and  based  on  a 
comparison  of  F  values  it  is  a  more  robust  model  than  that 
in  eq  4. 

Finally,  we  conducted  the  stepwise  regression  modeling 
asing  all  topostructural  and  topochemical  indices  with  HBi 
and  fi  for  the  complete  set  of  476  chemicals.  The  resulting 
ten  parameter  mo^l  used  three  topostructural  indices,  six 
topochemical  indices,  and  HBf. 

logio(P«ip)  =  9.67  -  3.66(‘x)  +  035(P^  +  O.TdCP,)  - 
1.78aCo)  -  3.33(SICi)  -  O.SUCICj)  +  2.05(*;^  - 
1.73(V)  -  0.79(V)  -  b.29(HBi)  (5) 

n  =  476,  =  s  =  029,  F  =  249.5 

Equation  5  shows  marked  improvement  over  cq  2, 
justifying  the  addition  of  indices  to  tile  modeL  Also,itmeets 
tiiecriteriaon  which  eq  4  was  judged  to  be  lacking.  Overall, 
there  is  an  improvement  in  variance  explained  of  5%,  with 
a  comparable  decrease  in  standard  deviatioiL  A  scatter  plot 
of  observed  logio(pvap)  versus  estimated  logio(pvap)  using  eq 
5  is  presented  in  Figure  1. 

4.  DISCUSSION 

The  purpose  of  this  paper  was  2-fold:  (a)  to  study  the 
utility  of  algorithmically-derived  molecular  descriptors  in 
developing  QS  AR  models  for  predicting  the  vapor  pressure 
of  chemicals  from  structure  and  b)  to  investigate  the  relative 


BASAK  ET  AL. 


Table  4.  Summary  of  the  Chemical  Class'Composltion  of  tiie 
Normal  Vapor  Pressure  Dataset 


compd  classification 

no.  of  compds 

pure 

substituted 

total  normal  vapor  pressure  dataset 

476 

hydrocarbons 

253 

non-hydrocarbons* 

223 

nitrocompounds 

4 

3 

1 

amines 

20 

17 

3 

nitriles 

7 

6 

1 

ketones 

7 

7 

0 

halogens 

100 

95 

5 

anhydrides 

1 

1 

0 

esters 

18 

16 

2 

carboxylic  acids 

2 

2 

0 

alcohols 

10 

6 

4 

sulfides 

39 

38 

1 

tiiiols 

4 

4 

0 

2 

2 

0 

qxixides 

1 

1 

0 

aromatic  compounds^ 

15 

10 

4 

fiised-ring  compounds^ 

1 

1 

0 

*  The  non-hydrocarbons  arc  further  broken  down  into  tiie  following 
groups.  ^Thc  15  aromatic  compounds  are  a  mixture  of  11  aromatic 
hydiwarbons  and  four  aromatic  halides.  ^  The  only  fused-ring  com- 

pound  was  a  polycyclic  aromatic  hydrocarbon. 

roles  of  topostructural,  topochemical,  and  geometrical  indices 
in  the  estimation  of  standard  vapor  pressure. 

Results  described  in  this  psper  (eqs  1—5)  show  that 
nonempirical  parametcre  derived  predominahtly  from  grzph 
theoretic  models  of  molecules  can  estimate  normal  vspor 
pressure  of  diverse  chemicals  reasonably  welL  The  ex¬ 
plained  variance  of  data  (R^  =  84.3%)  is  excellent  in  view 
of  the  fact  tiiat  the  database  of  chemicals  analyzed  in  this 
psqper  is  very  diverse  (see  Table  4).  It  should  be  mentioned 
th^  most  pi^lished  QSAR  models  for  tiie  estimation  of  vapor 
pressure  ^ve  dealt  vdth  much  smaller  data  sets  with  limited 
structural  variety.^^^^ 

The  relative  effectiveness  of  topostructural,  topochemical, 
and  geometdeal  indices  in  predicting  normal  vapor  pressure 
of  chemicals  is  evident  from  the  result  presented  above. 
Equation  1  explains  over  51%  of  variance  in  the  data.  All 
parameters  used  to  derive  eq  1  are  topostractural,  they 
are  parameters  which  encode  information  about  the  adjacency 
and  distance  of  vertices  in  skeletal  molecular  graphs  witiiput 
quantifying  any  explicit  information  about  such  chemical 
aspects  like  bond  order,  electronic  character  of  atoms,  etc. 
Yet,  tiie  high  explained  variance  of  the  property  indicates 
that  adjacoicy  and  distance  in  diemical  graphs,  bdng  general 
descriptors  of  molecular  size,  slupe,  and  brandling,  are 
important  in  predicting  properties.  TUs  may  explain  the 
success  of  parameters  like  simple  connectivity  indices  in 
estimating  many  divase  properties.^ 

Equation  3  is  derived  only  from  topochemical  indices.  The 
explained  variance  of  vapor  pressure  (75.8%)  shows  tiiat 
topodiemical  parameters,  as  a  class,  explain  a  larger  fraction 
of  the  variance  as  compared  to  models  derived  from  only 
topostructural  indices  (eq  1).  Geometrical  parameters  w^ 
dropped  from  tiie  set  of  descriptors  after  thdr  limited  success 
in  prediction  for  the  training  and  test  sets.  This  is  in  line 
with  our  earlier  studies  with  normal  boiling  point  and 
hydrophobicity,  where  it  was  reported  that  the  addition  of 
geometrical  indices  could  not  significantly  improve  the 
predictive  power  of  QSAR  models  derived  from  a  combined 
set  of  topostructural  and  topochemical  parameters'^  It  would 
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be  interesting  to  see  whetha:  this  pattern  holds  good  for  other 
properties  as  well,  Hnally,  the  addition  of  the  simple 
nonempirical  parameter,  HBi,  which  contains  information 
relevant  to  intermolecular  interactions  further  improves  our 
ability  to  estimate  normal  vapor  pressure  resulting  in  an 
explained  variance  of  843%  (eq  5). 
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ABSTRACT 


Four  classes  of  theoretical  structural  parameters,  viz.,  topostructural,  topochemical, 
geometrical  and  quantum  chemical  descriptors,  have  been  used  in  the  development  of 
quantitative  structure-activity  relationship  (QSAR)  models  for  a  set  of  sixty-nine 
benzene  derivatives.  None  of  the  individual  classes  of  parameters  was  very  effective  in 
predicting  toxicity.  A  hierarchical  approach  was  followed  in  using  a  combination  of  the 
four  classes  of  indices  in  QSAR  model  development.  The  results  show  that  the 
hierarchical  QSAR  approach  using  the  algorithmically  derived  molecular  descriptors  can 
estimate  the  LCm  values  of  the  benzene  derivatives  reasonably  well. 

KEYWORDS 
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INTRODUCTION 


Today's  toxicologist  is  faced  with  a  myriad  of  unknowns.  In  1996  approximately  1.26 
million  new  chemicals  were  registered  with  the  Chemical  Abstract  Service  (CAS), 
bringing  the  total  number  of  registered  chemicals  to  around  15.8  million  [1].  With  such  a 
large  number  of  chemicals  being  registered  yearly,  it  is  impossible  to  test  all  of  them 
exhaustively  for  their  effects  on  the  environment  and  human  health.  Chemicals  can  only 
be  evaluated  as  they  are  called  into  question,  and  for  many  of  these  compounds  there 
will  be  little  or  no  test  data  available.  Therefore,  when  the  issue  of  hazard  assessment 
comes  up,  it  becomes  difficult  at  best  to  provide  any  useful  suggestions  or  analyses  for 
many  of  the  registered  chemicals,  including  some  which  are  in  commerce  today.  To 
complete  the  battery  of  tests  necessary  for  the  proper  hazard  assessment  of  a  single 
compound  is  an  extremely  costly  procedure  and  there  is  simply  not  enough  time  or 
money  to  complete  these  test  batteries  for  all  compounds  which  are  registered  today 
[2].  As  a  result,  when  we  need  to  evaluate  the  human  health  or  ecological  hazards 
posed  by  a  chemical  it  becomes  ever  more  Important  that  we  have  accurate  methods 
for  estimating  the  physicochemical  and  biological  properties  of  molecules. 

Quantitative  structure-activity  relationships  (QSARs)  have  come  into  widespread 
use  for  the  prediction  of  various  molecular  properties  and  biological  responses. 
Traditional  QSARs  use  empirical  properties:  e.g.,  boiling  point,  melting  point,  octanol- 
water  partition  coefficient;  or  empirically  derived  parameters;  e.g.,  linear  free  energy 
related  (LFER)  and  linear  solvation  energy  related  (LSER)  parameters;  for  the 
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prediction  of  other  endpoints  [3-8].  However,  due  to  the  scarcity  of  available  data  for  the 
majority  of  chemicals  that  need  to  be  evaluated  for  ecotoxicological  risk  assessment, 
these  physicochemical  properties  necessary  for  traditional  QSAR  model  development 
may  not  be  known.  When  this  is  the  case,  it  is  imperative  that  we  have  methods  that 
make  use  of  nonempirical  parameters.  One  of  the  fundamental  principles  of 
biochemistry  is  that  activity  is  dictated  by  structure  [9].  Following  this  principle,  one  can 
use  theoretical  molecular  descriptors  which  quantify  structural  aspects  of  the  molecular 
structure  [10-27].  Thesp  theoretical  descriptors  can  be  generated  directly  from  the 
molecular  structure  alone,  without  any  input  of  experimental  data. 

Topological  indices  (TIs)  are  numerical  graph  invariants  that  quantify  certain 
aspects  of  molecular  structure.  TIs  are  sensitive  to  such  structural  features  as  size, 
shape,  bond  order,  branching,  and  neighborhood  patterns  of  atoms  in  molecules.  They 
can  be  derived  from  simple  linear  graphs,  multigraphs,  weighted  graphs,  and  weighted 
pseudographs.  TIs  derived  from  these  different  classes  of  graphs  will  encode  different 
types  of  information  about  molecular  architecture.  The  different  classes  of  TIs  provide 
us  with  nonempirical,  quantitative  descriptors  that  can  be  used  in  place  of 
experimentally  derived  descriptors  in  QSARs  for  the  prediction  of  properties. 

Our  recent  studies  have  focused  on  the  role  of  different  classes  of  theoretical 
descriptors  of  increasing  levels  of  complexity  and  their  utility  in  QSAR  [28-31].  This 
takes  the  form  of  a  hierarchical  approach  which  examines  the  relative  contributions  of 
parameters  of  gradually  increasing  complexity;  e.g.,  structural,  chemical,  shape,  and 
quantum  chemical  descriptors;  in  estimating  physicochemical  and  biological  properties 


In  this  paper  we  have  reported  the  utility  of  this  hierarchical  approach  in  modeling  the 
acute  aquatic  toxicity  (LCso)  of  a  congeneric  set  of  sixty-nine  benzene  derivatives. 

THEORETICAL  METHODS 

Database 


Acute  aquatic  toxicity  [-log(LCso)]  in  fathead  minnow  {Pimephales  promelas)  data  was 
taken  from  the  work  of  Hall,  Kier  and  Phipps  [32].  Their  data  was  compiled  from  eight 
other  sources,  as  well  as  some  original  work  which  was  conducted  at  the  U.S. 
Environmental  Protection  Agency  (USEPA)  Environmental  Research  Laboratory  in 
Duluth,  Minnesota.  The  complete  set  of  fathead  minnow  data  included  69  benzene 
derivatives.  According  to  the  authors,  the  set  of  benzene  derivatives  were  tested  using 
methodologies  which  were  comparable  to  their  96-hour  fathead  minnow  toxicity  test 
system.  The  derivatives  chosen  for  this  study  have  seven  different  substituent  groups 
that  are  all  present  in  at  least  six  of  the  molecules.  These  groups  consist  of  chloro, 
bromo,  nitro,  methyl,  methoxyl,  hydroxyl,  and  amino  substituents  (Table  I). 


Computation  of  Indices 

Four  distinct  sets  of  theoretical  descriptors  have  been  used  in  this  study.  These  sets 
include  topostructural,  topochemical,  geometric,  and  quantum  chemical  indices.  The 
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topostructural  and  topochemical  indices  fall  into  the  category  normally  grouped  together 
as  topological  indices.  The  geometrical  indices  are  three-dimensional  Wiener  number 
for  hydrogen-filled  molecular  structure,  hydrogen-suppressed  molecular  structure,  and 
van  der  Waals  volume. 

Topostructural  indices  (TSls)  are  topological  indices  which  only  encode 
information  about  the  adjacency  and  distances  of  atoms  (vertices)  in  molecular 
structures  (graphs),  irrespective  of  the  chemical  nature  of  the  atoms  involved  in  bonding 
or  factors  such  as  hybridization  states  and  the  number  of  core/valence  electrons  in 
individual  atoms.  Topochemical  indices  (TCls)  are  parameters  that  quantify  information 
regarding  the  topology  (connectivity  of  atoms),  as  well  as  specific  chemical  properties  of 
the  atoms  comprising  a  molecule.  These  indices  are  derived  from  weighted  molecular 
graphs  where  each  vertex  (atom)  or  edge  (bond)  is  properly  weighted  with  selected 
chemical  or  physical  property  information.  The  sets  of  topostructural  and  topochemical 
indices  are  shown  in  Table  II. 

Topological  Indices 

The  102  topological  indices  used  in  this  study,  both  the  topostructural  and  the 
topochemical,  have  been  calculated  by  POLLY  2.3  [33]  and  software  developed  by  the 
authors.  These  indices  include  Wiener  index  [34],  connectivity  indices  developed  by 
Randic  [35]  and  higher  order  connectivity  indices  formulated  by  Kier  and  Hall  [36], 
bonding  connectivity  indices  defined  by  Basak  et  al.  [37],  a  set  of  information  theoretic 
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indices  defined  on  the  distance  matrices  of  simple  molecular  graphs  [38,39]  and 
neighborhood  complexity  indices  of  hydrogen-filled  molecular  graphs  [40,41],  and 
Balaban '  s  J  indices  [42-44].  Table  III  provides  the  list  of  the  topostructural, 
topochemical,  and  geometrical  indices  included  in  this  study. 

Geometrical  Indices 

Van  der  Waals  volume,  Vw  [45-47],  was  calculated  using  Sybyl  6. 1  from  Tripos 
Associates,  Inc  [48].  The  3-D  Wiener  numbers  were  calculated  by  Sybyl  using  an  SPL 
(Sybyl  Programming  Language)  program  developed  in  our  lab  [49].  Calculation  of  3-D 
Wiener  numbers  consists  of  the  sum  entries  In  the  upper  triangular  submatrix  of  the 
topographic  Euclidean  distance  matrix  for  a  molecule.  The  3-D  coordinates  for  the 
atoms  were  determined  using  CONCORD  3.0.1  [50].  Two  variants  of  the  3-D  Wiener 
number  were  calculated:  ^°W„  and  ®°W.  For^WH,  hydrogen  atoms  are  included  in  the 
computations  and  for  ^°W,  hydrogen  atoms  are  excluded  from  the  computations. 

Quantum  Chemical  Parameters 

The  following  quantum  chemical  parameters  were  calculated  using  the  Austin  Model 
version  one  (AMI)  semi-empirical  Hamiltonian;  energy  of  the  highest  occupied 
molecular  orbital  (Ehomo).  energy  of  the  second  highest  occupied  molecular  orbital 
(Ehomoi).  energy  of  the  lowest  unoccupied  molecular  orbital  (Emwo).  energy  of  the  second 
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lowest  unoccupied  molecular  orbital  (Elomoi).  heat  of  formation  (AH,),  and  dipole  moment 
(|i).  These  parameters  were  calculated  using  MOPAC  6.00  in  the  SYBYL  interface  [51]. 

Data  Reduction 

Initially,  all  topological  indices  were  transformed  by  the  natural  logarithm  of  the  index 
plus  one.  This  was  done  to  scale  the  indices,  since  some  may  be  several  orders  of 
magnitude  greater  than  others,  while  other  indices  may  equal  zero.  The  geometric 
indices  were  transformed  by  the  natural  logarithm  of  the  index  for  consistency,  the 
addition  of  one  was  unnecessary. 

The  set  of  eighty-one  topological  indices  was  then  partitioned  into  two  distinct 
sets,  the  topostructural  indices  (thirty-four)  and  the  topochemical  indices  (forty-seven). 
To  further  reduce  the  number  of  independent  variables  for  model  construction,  the  sets 
of  topostructural  and  topochemical  indices  were  further  divided  into  subsets,  or  clusters, 
based  on  the  correlation  matrix  using  the  SAS  procedure  VARCLUS  [52].  This 
procedure  divides  the  set  of  indices  into  disjoint  clusters,  such  that  each  cluster  is 
essentially  unidimensional. 

From  each  cluster  we  selected  the  Index  most  correlated  with  the  cluster,  as  well 
as  any  indices  which  were  poorly  correlated  with  their  cluster  (ff  <  0.70).  These  Indices 
were  then  used  In  the  modeling  of  the  acute  aquatic  toxicity  of  benzene  derivatives  in 
fathead  minnow.  The  variable  clustering  and  selection  of  indices  was  performed 
independently  for  both  the  topostructural  and  topochemical  indices.  This  procedure 


8 


resulted  in  a  set  of  five  topostructural  indices  and  a  set  of  nine  topochemical  indices. 

Reducing  the  number  of  independent  variables  is  critical  when  attempting  to 
model  small  datasets.  The  smaller  the  dataset  is,  the  greater  the  chance  of  spurious 
error  when  using  a  large  number  of  independent  variables  (descriptors).  Topliss  and 
Edwards  have  studied  this  issue  of  chance  correlations  [53].  For  a  set  with  about 
seventy  dependent  variables  (observations),  to  keep  the  probability  of  chance 
correlations  less  than  0.01,  we  can  use  at  most  forty  independent  variables.  This 
number  is  dependent  on  the  actual  correlation  achieved  in  the  modeling  process,  with  a 
high  correlation  we  have  a  better  chance  of  using  more  variables  with  the  same  limited 
probability  of  chance  correlations.  In  this  study  we  are  well  below  the  cut-off  of  forty.  In 
fact,  the  total  number  of  descriptors  which  will  be  used  for  model  construction  and 
estimation  is  twenty-three,  well  within  the  bounds  of  the  Topliss  and  Edwards  criteria 
[53]. 

Statistical  Analysis  and  Hierarchical  QSAR 

Regression  modeling  was  accomplished  using  the  SAS  procedure  REG  on  seven 
distinct  sets  of  indices.  These  sets  were  constructed  as  part  of  a  hierarchical  approach 
to  QSAR  model  development.  The  hierarchy  begins  with  the  simplest  parameters,  the 
TSIs.  After  using  the  TSIs  to  model  the  activity,  the  next  level  of  complexity  is  added. 
To  the  Indices  included  in  the  best  TSI  model,  we  add  all  of  the  TCIs  and  proceed  to 
model  the  activity  using  these  parameters.  Likewise,  the  indices  included  in  the  best 
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model  from  this  procedure  are  combined  with  the  indices  from  the  next  level,  the 
geometrical  indices  and  modeling  is  conducted  once  again.  Finally,  the  best  model 
utilizing  TSls,  TCIs  and  geometrical  Indices  Is  combined  with  the  quantum  chemical 
parameters.  The  regression  analysis  results  in  the  final  selection  of  indices  for  each  of 
the  models.  The  remaining  three  models  which  use  TCIs,  geometric,  and  quantum 
chemical  parameters  independently  serve  as  a  means  of  validating  the  utility  of  the 
hierarchical  approach  and  the  need  for  varying  types  of  theoretical  descriptors. 


RESULTS 


The  variable  clustering  of  the  topostructural  indices  resulted  in  the  retention  of  five 
indices:  M,,  1C,  O,  Pa,  Pg.  All-possible  subsets  regression  resulted  in  the  selection  of  a 
four-parameter  model  to  estimate  -log(LC5o)  with  an  explained  variance  (Ff)  of  45.3% 
and  a  standard  error  (s)  of  0.58.  While  this  is  an  unsatisfactory  model,  the  indices  will 
still  be  retained  and  combined  with  the  topochemical  indices  in  the  second  step  of 
model  development.  Table  IV  lists  the  indices  used  in  each  of  the  models. 

The  second  step  of  the  hierarchical  method  combined  the  four  indices  used  in 
the  first  tier  model  with  the  nine  topochemical  Indices  selected  in  the  variable  clustering 
procedure:  SICq,  SIC,,  SIC4,  CICo,  V>  Vci  Vc.  Vpc.  Again,  all-possible  subsets 
regression  was  conducted  resulting  in  a  four-parameter  model  with  an  explained 
variance  {Ft)  of  78.3%  and  a  standard  error  (s)  of  0.36.  While  this  model  retained  two 
parameters  from  the  topostructural  model,  it  Is  evident  that  the  addition  of  two 
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topochemical  indices  made  a  significant  contribution  to  the  effectiveness  of  our  model. 

The  four  indices  from  the  second  tier  model  were  then  combined  with  the  three 
geometric  parameters;  ^°Wh,  ^°W,  Vw  The  resulting  model  from  this  procedure  retained 
four  indices,  replacing  the  topochemical  index  CICo  with  the  geometric  parameter  Wh. 
This  model  had  an  explained  variance  {Ft)  of  79.2%  and  a  standard  error  (s)  of  0.36. 

The  final  step  in  the  hierarchical  method  combined  the  four  parameters  from  the 
third  tier  model  with  the  guantum  chemical  (AM1)  parameters:  Ehomoi  Ehomou  Elumoi  Elumou 
AH,,  ^l.  This  set  of  ten  indices  led  to  a  seven-parameter  model  with  an  explained 
variance  {Ft)  of  86.3%  and  a  standard  error  (s)  of  0.30.  This  model  retained  all  of  the 
indices  from  the  third  model  and  added  three  quantum  chemical  parameters. 

Three  other  models  were  constructed  for  the  purpose  of  comparison.  These 
include  a  five-parameter  topochemical  model,  a  three  parameter  geometric  model,  and 
a  four-parameter  quantum  chemical  model.  The  indices  used  in  these  models  and  the 
results  of  the  models  can  be  found  in  Table  IV. 

DISCUSSION 

The  goal  of  this  paper  was  to  investigate  the  utility  of  hierarchical  QSAR  using 
algorithmically  derived  molecular  descriptors  in  predicting  LCgo  values  for  a  set  of  sixty- 
nine  benzene  derives.  To  this  end,  we  used  four  classes  of  parameters,  viz., 
topostructural  descriptors,  topochemical  indices,  geometrical  descriptors  and  semi- 
empirical  quantum  chemical  indices. 
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It  is  clear  from  the  results  described  in  Table  IV  that  none  of  the  individual 
classes  of  parameters  correlate  well  with  acute  aquatic  toxicity.  The  TSIs,  the  simplest 
of  the  four  classes  of  parameters,  explained  about  45%  of  the  variance  in  toxicity.  The 
inclusion  of  topochemical  indices  in  the  set  of  independent  variables  made  substantial 
improvement  in  the  predictive  capacity  of  the  QSAR  models.  This  Is  understandable 
since  the  benzene  derivatives  analyzed  in  this  paper  comprise  a  fairly  congeneric  set, 
and  while  the  number  and  size  of  substituents  may  be  important,  the  chemical  nature  of 
the  substituents  also  plays  an  important  role  In  determining  the  overall  toxicity  of  the 
molecule.  This  is  shown  by  the  dramatic  increase  in  predictive  power  between 
equations  1  and  2.  Equation  2  replaces  two  TSI  descriptors  with  two  TCI  indices  that 
are  sensitive  to  the  atom  types  in  all  zero-order  neighborhoods.  The  addition  of  this 
basic  chemical  information  results  in  an  improvement  in  the  model.  A  similar 
conclusion  is  borne  out  from  the  QSAR  analysis  of  the  same  set  of  benzene  derivatives 
reported  by  Hall  et  al.  where  they  found  that  the  chemical  nature  of  the  substituent  is  in 
important  in  determining  toxicity  [32]. 

In  the  next  tier,  equation  3  replaces  one  of  the  information  content  indices  with 
the  three-dimensional  Wiener  number,  a  descriptor  that  characterizes  the  three- 
dimensional  aspects  of  molecular  shape  and  size.  This  leads  to  refinement  of  the  model 
developed  in  equation  2.  Finally,  the  addition  of  the  quantum  chemical  parameters; 
energy  of  the  second  lowest  unoccupied  molecular  orbital ,  heat  of  formation,  and 
dipole  moment:  leads  to  a  marked  improvement  in  the  predictive  power  of  the  model 
(equation  4). 
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As  can  be  seen  from  equations  1  and  5-7  (Table  IV),  none  of  the  four  classes  of 
indices  do  very  well  individually.  The  hierarchical  QSAR  approach  using  four  classes  of 
parameters  resulted  in  acceptable  predictive  models  (equation  4).  We.  may  conclude 
from  the  results  presented  in  this  paper  that  each  of  the  four  classes  of  theoretical 
descriptors  that  were  used  are  necessary  for  the  development  of  good  QSARs  for  the 
acute  aquatic  toxicity  of  benzene  derivatives  in  fathead  minnow. 
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Table  1.  Sixty-nine  benzene  derivatives  and  their  fathead  minnow  toxicities,  expTessed  as 
-iog(LCj. 


No. 

Compound 

-log(LCJ 

(obs.) 

-log(LCJ 
(est.  eq.  4) 

Residual 

1 

Benzene 

3.40 

3.42 

-0.02 

2 

Bromobenzene 

3.89 

3.77 

0.12 

3 

Chlorobenzene 

3.77 

3.75 

0.02 

4 

Phenol 

3.51 

3.38 

0.13 

5 

Toluene 

3.32 

3.66 

-0.34 

6 

1 ,2-dichlorobenzene 

4.40 

4.29 

0.11 

7 

1 ,3-dichlorobenzene 

4.30 

4.37 

-0.07 

8 

1 ,4-dichlorobenzene 

4.62 

4.51 

0.11 

9 

2-chlorophenol 

4.02 

3.79 

0.23 

10 

3-chlorotoluene 

3.84 

3.88 

-0.04 

11 

4-chlorotoluene 

4.33 

3.87 

0.46 

12 

1 ,3-dihydroxybenzene 

3.04 

3.43 

-0.39 

13 

3-hydroxyanisole 

3.21 

3.33 

-0.12 

14 

2-methylphenol 

3.77 

3.64 

0.13 

15 

3-methylphenol 

3.29 

3.60 

-0.31 

16 

4-methylphenol 

3.58 

3.53 

0.05 

17 

4-nitrophenol 

3.36 

3.61 

-0.25 

18 

1 ,4-dimethoxybenzene 

3.07 

3.28 

-0.21 

19 

1 ,2-dimethylbenzene 

3.48 

3.93 

-0.45 

20 

1 ,4-dimethylbenzene 

4.21 

3.87 

0.34 

21 

2-nitrotoluene 

3.57 

3.66 

-0.09 

22 

3-nitrotoluene 

3.63 

3.53 

0.10 

23 

4-nitrotoluene 

3.76 

3.49 

0.27 

24 

1 ,2-dinitrobenzene 

5.45 

5.24 

0.21 

25 

1 ,3-dinitrobenzene 

4.38 

4.18 

0.20 

26 

1 ,4-dinitrobenzene 

5.22 

4.94 

0.28 

27 

2-methyl-3-nitroaniline 

3.48 

3.79 

-0.31 

28 

2-methyl-4-nitroaniiine 

3.24 

3.51 

-0.27 

29 

2-methyl-5-nitroaniline 

3.35 

3.68 

-0.33 

30 

2-methyl-6-nitroanirme 

3.80 

3.84 

-0.04 

31 

3-methyl-6-nitroaniline 

3.80 

3.78 

0.02 

32 

4-methyl-2-nitroaniline 

3.79 

3.80 

-0.01 

33 

4-hydroxy-3-nitroaniline 

3.65 

3.61 

0.04 

34 

4-methyl-3-nitroaniline 

3.77 

3.73 

0.04 

35 

1 ,2,3-trichIorobenzene 

4.89 

4.89 

-0.00 

36 

1 ,2,4-trichlorobenzene 

5.00 

5.04 

-0.04 

37 

1 ,3,5-trichlorobenzene 

4.74 

5.11 

-0.37 

38 

2,4-dichlorophenol 

4.30 

4.33 

-0.03 

39 

3,4-dichlorotoluene 

4.74 

4.26 

0.48 

40 

2,4-dichlorotoluene 

4.54 

4.36 

0.18 

41 

4-chloro-3-methylphenol 

4.27 

3.87 

0.40 

42 

2,4-dimethylphenol 

3.86 

3.76 

0.10 

43 

2,6-dimethylphenol 

3.75 

3.80 

-0.05 

44 

3,4-dimethylphenol 

3.90 

3.80 

0.10 

45 

2,4-dinitrophenol 

4.04 

4.14 

-0.10 

46 

1 ,2,4-trimethylbenzene 

4.21 

4.09 

0.12 

47 

2,3-dinitrotoluene 

5.01 

5.20 

-0.19 

48 

2,4-dinitrotoluene 

3.75 

4.10 

-0.35 

49 

2,5-dinitrotoluene 

5.15 

4.84 

0.31 

50 

2,6-dinitrotoluene 

3.99 

4.41 

-0.42 

51 

3,4-dinitrotoluene 

5.08 

5.11 

-0.03 

52 

3,5-dinitrotoluene 

3.91 

4.05 

-0.14 

53 

1 ,3,5-trinitrobenzene 

5.29 

5.37 

-0.08 

54 

2-methyl-3,5-dinitroaniline 

4.12 

4.13 

-0.01 

5.34 


4.80 


0.54 


55 

2-methyl-3,6-dinitroaniline 

56 

3-methyl-2,4-dinitroaniline 

57 

5-methyI-2,4-dinitroaniline 

58 

4-methyl-2,6-dinitroaniline 

59 

5-methyI-2,6-dinitroaniline 

60 

4-methyl-3,5-dinitroaniline 

61 

2,4,6-tribromophenol 

62 

1 ,2,3,4-tetrachlorobenzene 

63 

1 ,2,4,5-tetrachlorobenzene 

64 

2,4,6-trichlorophenol 

65 

2-methyl-4,6-dinitrophenol 

66 

2,3,6-trinitrotoluene 

67 

2,4,6-trinitrotoluene 

68 

2,3,4, 5-tetrachlorophenol 

69 

2,3,4,5,6-pentachlorophenol 

4.26 

4.28 

-0.02 

4.92 

4.14 

0.78 

4.21 

4.67 

-0.46 

4.18 

4.80 

-0.62 

4.46 

4.34 

0.12 

4.70 

4.89 

-0.19 

5.43 

5.62 

-0.19 

5.85 

5.80 

0.05 

4.33 

4.79 

-0.46 

5.00 

4.21 

0.79 

6.37 

6.36 

0.01 

4.88 

5.16 

-0.28 

5.72 

5.36 

0.36 

6.06 

6.03 

0.03 

Table  II.  Symbols  and  definitions  of  topological  and  geometrical  parameters. _ 

lo  Information  index  for  the  magnitudes  of  distances  between  all  possible 

pairs  of  vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix 
of  a  graph 

.  1°  Degree  complexity 

H''  Graph  vertex  complexity 

H°  Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 


^ORB 

O 


M, 

1C, 

SIC, 


CIC. 


X 

Xc 

"Xch 

"Xpc 


h  b 

X 

'x^ 

'x^h 


It  b 

Xpc 


Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

Order  of  neighborhood  when  1C,  reaches  its  maximum  value  for  the 
hydrogen-filled  graph 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 
neighboring  (connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  (r  =  0-5) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  (r  =  0-5)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

Complementary  information  content  for  r*’  (r  =  0-5)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 

Path  connectivity  index  of  order  h  =  0-6 

Cluster  connectivity  index  of  order  h  =  3,  5 

Chain  connectivity  index  of  order  h  =  6 

Path-cluster  connectivity  index  of  order  h  =  4-6 

Bond  path  connectivity  index  of  order  h  =  0-6 

Bond  cluster  connectivity  index  of  order  h  =  3,  5 

Bond  chain  connectivity  index  of  order  h  =  6 

Bond  path-cluster  connectivity  index  of  order  h  =  4-6 


Valence  path  connectivity  index  of  order  h  =  0-6 

Valence  cluster  connectivity  index  of  order  h  =  3,  5 

Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Number  of  paths  of  length  h  =  1-9 

Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban’s  J  index  based  on  relative  electronegativities 

Balaban's  J  index  based  on  relative  covalent  radii 

van  der  Waals  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance 
matrix 

3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 


Table  111.  Classification  of  parameters  used  in  developing  models  for  acute  aquatic  toxicity 

(LC  J  in  Pimephales  promelas. _ _ _ 

Quantum  Chemical 

Topological  Topochemical _ Geometric _ AMI _ 


iW 

D 

^ORB 

V. 

^HOMO 

■jw 

ICo  -  1C, 

F 

“-HOMOI 

\N 

Sia-SIC, 

“W, 

^LUMO 

? 

CIC,-CIC, 

^LUMOI 

H" 

Y  -  Y 

AHf 

H“ 

YcandYc 

1C 

V 

A,  Ch 

0 

%•>  .  v*” 

^  PC  A,  PC 

M, 

Y  -  Y 

YcandYc 

0  6 

X-  X 

4  V  6^  V 

X  PC  "  X  PC 

'Xc^nd^Xc 

f 

"Xa, 

f 

4  6^ 

Xpc  ”  Xpc 

f 

P  -  P 

*  1  *  9 


J 


Table  IV.  Summary  of  the  regression  results  for  all  models  for  the  full  set  of  "sixty-nine 
benzene  derivatives. 


Table  V.  Calculated  values  for  the  topostructural,  topochemical,  geometric,  arid  quantum 
chemical  parameters  used  in  equation  4  (Table  IV). 


No. 

M, 

P. 

SIC, 

"Wh 

^LUMOI 

AH, 

1 

3 

0 

0.246 

5.21 

0.5540 

22.0240 

0.005 

2 

3 

0 

0.315 

5.25 

0.2447 

26.7581 

1.449 

3 

3 

0 

0.315 

5.25 

0.2632 

14.8214 

1.299 

4 

3 

0 

0.304 

5.43 

0.5095 

-22.2334 

1.233 

5 

3 

0 

0.227 

5.79 

0.5745 

16.5004 

0.279 

6 

4 

0 

0.341 

5.28 

-0.0203 

9.2203 

1.974 

7 

4 

0 

0.341 

5.28 

-0.0462 

8.2544 

1.218 

8 

4 

0 

0.341 

5.28 

-0.0988 

10.4661 

0.000 

9 

4 

0 

0.362 

5.46 

0.2406 

-28.6621 

0.934 

10 

4 

0 

0.284 

5.81 

0.2785 

7.1915 

1.478 

11 

4 

0 

0.284 

5.82 

0.3208 

7.1066 

1.623 

12 

4 

0 

0.323 

5.64 

0.3778 

-66.4516 

2.433 

13 

4 

0 

0.295 

6.16 

0.4618 

-59.9961 

2.338 

14 

4 

0 

0.276 

5.95 

0.5331 

-28.9297 

0.960 

15 

4 

0 

0.276 

5.97 

0.5610 

-29.6368 

1.079 

16 

4 

0 

0.276 

5.97 

0.4880 

-29.7869 

1.333 

17 

4 

0 

0.376 

5.84 

-0.4095 

-19.5199 

5.261 

18 

4 

0 

0.274 

6.59 

0.5766 

-52.9350 

2.424 

19 

4 

0 

0.213 

6.22 

0.6180 

7.5221 

0.465 

20 

4 

0 

0.213 

6.28 

0.6450 

6.8236 

0.003 

21 

4 

0 

0.341 

6.11 

-0.2692 

19.0823 

5.015 

22 

4 

0 

0.341 

6.14 

-0.2921 

17.6145 

5.443 

23 

4 

0 

0.341 

6.15 

-0.2334 

17.2948 

5.728 

24 

4 

2 

0.389 

5.99 

-1 .2793 

38.6210 

7.804 

25 

4 

0 

0.389 

6.01 

-1 .5339 

33.1466 

4.845 

26 

4 

0 

0.389 

6.02 

-1.0875 

33.2941 

0.013 

27 

4 

0 

0.344 

6.38 

-0.1596 

20.4489 

5:727 

28 

4 

0 

0.344 

6.41 

-0.0919 

14.3213 

7.434 

29 

4 

0 

0.344 

6.41 

-0.1084 

19.7541 

6.185 

30 

4 

0 

0.344 

6.39 

-0.0006 

13.8471 

5.374 

31 

4 

0 

0.344 

6.42 

0.1022 

12.9086 

5.649 

32 

4 

0 

0.344 

6.42 

0.0314 

13.3128 

5.280 

33 

4 

0 

0.376 

6.15 

-0.2384 

-15.9560 

6.801 

34 

4 

0 

0.344 

6.41 

-0.1379 

18.0141 

5.596 

35 

4  . 

0 

0.349 

5.31 

-0.3391 

4.2313 

2.070 

36 

4 

0 

0.349 

5.31 

-0.2761 

2.9490 

1.033 

37 

4 

0 

0.349 

5.31 

-0.3927 

2.2158 

0.020 

38 

4 

0 

0.385 

5.49 

-0.1034 

-35.1296 

0.395 

39 

4 

0 

0.312 

5.84 

0.0251 

1.5862 

2.296 

40 

4 

0 

0.312 

5.84 

0.0006 

1.2199 

1.464 

41 

4 

0 

0.326 

5.99 

0.2063 

-36.1532 

1.059 

42 

4 

0 

0.255 

6.40 

0.5006 

-36.4200 

1.052 

43 

4 

0 

0.255 

6.38 

0.5503 

-35.5810 

1.199 

44 

4 

0 

0.255 

6.38 

0.5387 

-36.6403 

1.229 

45 

4 

0 

0.383 

6.17 

-1,5210 

-8.7887 

6.201 

46 

4 

0 

0.202 

6.64 

0.6477 

-0.1093 

0.274 

47 

4 

2 

0.365 

6.40 

-1.2262 

31.8226 

7.909 

48 

4 

0 

0.365 

6.43 

-1.4332 

26.3804 

5.390 

49 

4 

0 

0.365 

6.42 

-1.0421 

26.9397 

0.797 

50 

4 

0 

0.365 

6.39 

-1 .4076 

30.3487 

3.639 

51 

4 

2 

0.365 

6.43 

-1.1564 

32.0703 

8.256 

52 

4 

0 

0.365 

6.44 

-1.4923 

25.3294 

5.321 

53 

4 

0 

0.378 

6.33 

-2.5221 

44.8961 

0.032 

54 

4 

0 

0.362 

6.66 

-1.2453 

27.9172 

6.590 

55 

4 

0 

0.362 

6.65 

-0.6994 

25.1359 

3.166 

0 

0.362 

6.65 

-1.1532 

23.8377 

5-797 

0 

0.362 

6.67 

-1.3084 

51.2351 

7.196 

0 

0.362 

6.68 

-1.0204 

18.0757 

2.366 

0 

0.362 

6.66 

-1.0160 

54.7718 

.  3.199 

0 

0.362 

6.66 

-1.2172 

29.5227 

5.090 

0 

0.392 

5.54 

-0.4993 

-2.2014 

1.096 

0 

0.341 

5.34 

-0.5585 

-0.5979 

1.616 

0 

0.341 

5.34 

-0.6587 

3.2072 

0.000 

0 

0.392 

5.52 

-0.3777 

-38.2930 

1.083 

0 

0.362 

6.56 

-1.5102 

'  -19.8380 

4.669 

2 

0.365 

6.66 

-1.9189 

46.0695 

3.518 

0 

0.365 

6.67 

-2.3240 

41.4239 

1.418 

0 

0.385 

5.54 

-0.5526 

-43.2613 

1.231 

0 

0.362 

5.57 

-0.7546 

-44.7215 

1.238 
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ABSTRACT 


The  characterization  of  molecular  structure  using  structural  invariants  has 
increased  greatly  over  the  last  ten  years.  Specifically,  topological  indices  have 
become  more  widely  in  the  quantification  of  molecular  structure  for  use  in 
quantitative  structure-activity  relationship  studies,  chemical  documentation,  and 
molecular  similarity  studies.  The  basis,  calculation,  and  utility  of  topological 
indices  has  been  examined,  with  an  eye  to  the  specific  advantages  and 
problems  in  their  use.  In  addition,  variable  clustering  and  principal  component 
analysis  are  examined  as  two  potential  solutions  to  the  problem  of  index 
intercorrelation. 
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introduction 


An  important  area  of  research  in  computational  and  mathematical  chemistry  is 
the  characterization  of  molecular  structure  using  structural  Invariants.^  The 
impetus  for  this  research  trend  comes  from  various  directions.  Researchers  in 
chemical  documentation  have  searched  for  a  set  of  invariants  which  will  be  more 
convenient  than  the  adjacency  matrix  (or  connection  table)  for  the  storage  and 
comparison  of  chemical  structures.^®  Invariants  have  been  used  to  order  sets  of 
molecules.®^- With  the  substantial  increase  In  available  databases  of 
chemical  structures  and  properties,  attempts  have  been  made  to  develop 
structure-activity  relationships  (SARs)  whereby  existing  molecules  can  be 
compared  with  other  molecules  (real  or  hypothetical)  on  the  basis  of  these 
structural  invariants.  The  properties  of  the  molecules  of  interest  can  then  be 
predicted  based  on  molecular  structure  without  the  need  for  experimental  data. 

In  this  age  of  combinatorial  chemistry  thousands  of  molecules  of  known 
structure  can  be  produced  rapidly.  However,  at  the  same  time  resources  for 
determining  even  the  simplest  properties  of  all  of  these  molecules  In  the 
laboratory  are  unavailable.  In  the  USA,  the  Toxic  Substances  Control  Act  (TSCA) 
Inventory  includes  nearly  74,000  chemicals  and  the  list  is  growing  at  a  rate  of 
more  than  2,000  new  submissions  to  the  United  States  Environmental  Protection 
Agency  (USEPA)  for  the  Premanufacture  Notification  (PMN)  process  peryear.'^' 
At  present,  risk  assessment  of  the  PMN  chemicals  is  carried  out  using  limited 
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test  data.  For  example,  appro)dmately  15%  of  PMN  submissions  have  empirical 
mutagenidty  data  Under  sudi  drcumstanoes,  structural  descnptots  will  play  a 
pivotal  role  in  comparing  molecules  with  one  another  and  In  predicting  their 
properties. 

MOLECULAR  STRUCTURE  -  Beauty  in  the  Eye  of  the  Beholder  or  Conundrum? 

The  main  hurdle  to  the  characterization  of  molecular  structure  is  the  lack  of 
uniformity  its  the  definition  and  quantification.  The  term  molecular  structure 
represents  a  set  of  nonequivalent  and  probably  disjoint  concepts.  For  example, 
the  term  “molecule"  means  different  things  when  it  represents  an  assembly  of 
identifiable  atoms  held  together  by  fairly  rigid  bonds  as  compared  to  a  coliection 
of  delocalized  nuclei  and  electrons  in  which  all  identical  particles  are 
Indistinguishable.^^  There  is  no  reason  to  believe  that  when  we  discuss  diverse 
topics  {e.g.,  chemical  synthesis,  reaction  rates,  spectroscopic  transitions, 
reaction  mechanisms,  and  ab  initio  calculations)  using  the  notion  of  moiecular 
structure,  that  ttie  different  meanings  we  attach  to  this  term  originate  from  the 
same  fundamental  concept^*  “  This  fundamental  problem  has  been  described 
sucdnctly  by  Woolleyi“ 

«■  .there  is  no  reason  to  suppose  that  the  same  basic  idea  can 

provide  a  basis  for  the  discussion  of  all  molecular  experiments. 
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This  is  understandable  if  one  recognizes  that  every  physical  and 
chemical  concept  is  only  defined  vr^h  respect  to  a  certain  class  of 
experiments,  so  that  it  is  perfectly  reasonable  for  different  sets  of 
concepts,  although  mutually  incompatible,  to  be  applicable  to 
different  experiments.” 

In  the  context  of  molecular  science,  the  various  concepts  of  molecular 
structure  {e.g.,  classical  valence  bond  representation,  various  chemical  graph- 
theoretic  representations,  the  ball-and-stick  model,  representation  by  minimum 
energy  conformation,  semi-symbolic  contour  maps,  or  symbolic  representation 
by  Hamiltonian  operators)  are  distinct  molecular  models  derived  through 
different  means  of  abstraction  from  the  same  chemical  reality  or  molecule.^  In 
each  instance,  the  equivalence  dass  (concept  or  model  of  molecular  structure)  is 
generated  by  selecting  certain  aspects  while  ignoring  other  unique  properties  of 
those  actual  events.  This  explains  the  plurality  of  the  concepts  of  molecular 
structure  and  their  autonomous  nature,  the  word  autonomous  being  used  in  the 
sense  that  one  concept  is  not  logically  derived  from  tiie  other. 

GRAPHS  AND  MOLECULAR  STRUCTURE 


At  the  most  fundamental  level,  the  stmctural  model  of  an  assembled  entity  (e.g., 
a  molecule  consisting  of  atoms)  may  be  defined  as  the  pattern  of  relationship 
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among  Ks  parts  as  distinct  from  the  values  associated  with  them.^'‘  Constitutional 
formulae  of  molecules  are  graphs  where  vertices  represent  the  set  of  atoms  and 
edges  represent  chemical  bonds.’®  The  pattern  of  connectedness  of  atoms  in  a 
molecule  is  preserved  by  constitutional  graphs.  A  graph  (more  correctly  a  non- 
directed  graph)  G  =  [V,E\  consists  of  a  finite  nonempty  set  Voi  points  together 
with  a  prescribed  set  £of  unordered  pairs  of  distinct  points  of  V'.’®  A  structural 
model  assigns  to  the  points  of  G  a  realization  in  some  applied  field  and  each 
element  of  E  indicates  a  pair  of  entities  (elements  of  the  stmctural  model)  which 
are  in  the  finite  nonempty  irreflexive  symmetric  binary  relation  described  by  G. 
For  example,  when  elements  of  the  set  ^symbolize  atomic  cores  without 
valence  electrons  and  the  elements  of  E  represent  covalent  two-electron  bonds, 
G  is  the  molecular  graph  or  constitutional  graph  of  a  covalent  chemical  spedes. 
Such  a  graph  can  represent  structural  formulae  of  a  large  number  of  organic 
compounds.  Since  more  than  90%  of  chemical  compounds  described  so  far  are 
either  organic  or  contain  organic  ligands,  such  a  graph  h^  been  found  to  be 
useful  In  chemistry.^®  The  edge  set  need  not  always  represent  a  covalent  bond. 
In  fact,  elements  of  E  may  symbolize  almost  any  type  of  bond  (e.g.,  Ionic, 
coordinate,  hydrogen,  or  weak  bonds  representing  transition  states  of  an  SNg 
reaction,  etc.).^'“  If  the  Interaction  between  a  pair  of  atoms  Is  asymmetric  {e.g., 
in  case  of  suffidently  polar  covalent  bonds,  hydrogen  bond  donor  addity, 
hydrogen  bond  acceptor  basicity,  or  charge  transfer  complex  formation)  the 
bonding  pattern  can  be  represented  by  a  binary  relation  which  is  anti-reflexive 
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and  asymmetric.®  Further  refinement  could  be  achieved  through  the  assignment 
of  weights  to  the  vertices  or  edges,®  and  use  of  multiple  edges  between  a  pair  of 
atoms  held  together  both  by  sigma  and  pf  bonds.  The  weighted  pseudograph 
appears  to  be  the  most  general  model  capable  of  symbolizing  the  bonding 
pattern  of  a  large  number  of  organic  and  inorganic  chemicals. 

For  a  long  time,  chemists  have  relied  on  visual  perception  to  relate 
various  aspects  of  constitutional  graphs  to  observable  phenomena  The  power  of 
graph-theoretic  formalism  in  chemistry  is  evident  from  its  successful  applications 
in  chemical  documentation,  isomer  discrimination  and  characterization  of 
molecular  branching,  enumeration  of  constitutionai  isomers  associated  with  a 
particular  empirical  formula,  calculation  of  quantum  chemical  parameters, 
structure-physicochemical  property  correlations,  and  chemical  structure- 
biological  activity  relationships.®^ 

GRAPHS  AS  MOLECULAR  MODELS 


Any  concept  of  molecular  structure  is  a  hypothetical  sketch  of  the  organization  of 
atoms  within  the  moiecule.  Such  a  model  ob/ecf  is  a  general  theory  and  remains 
empirically  untestable.  A  model  object  has  to  be  grafted  to  a  specific  theory  to 
generate  a  theoretical  model  \Mdn  can  be  empirically  tested.®®  For  example, 
when  it  was  suggested  by  Sylvester  in  1 878  that  the  structural  formula  of  a 
molecule  is  a  special  kind  of  graph,  it  was  an  Innovative  general  theory  without 
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any  predictive  potential.*®  When  the  idea  of  coniblnatorics  was  applied  on 
chemical  graphs  (model  object),  it  could  be  predicted  that  "there  should  be 
exactly  two  Isomers  of  butane  (C4H,o)"  because  "there  are  exactly  two  tree 
graphs  with  four  vertices"  when  one  considers  only  the  non-hydrogen  atoms 
present  In  This  Is  a  theoretical  model  of  limited  predictive  potential. 

Although  it  predicts  the  existence  of  chemical  species,  given  a  set  of  molecules 
(e.g.,  isomers  of  hexane  [CeHi4))  the  model  Is  incapable  of  predicting  any 
properties  for  these  molecules.  This  is  due  to  the  fact  that  any  empirical  property 
P  maps  a  set  of  chemical  structures  Into  the  set  R  of  real  numbers  and  thereby 
orders  the  set  empirically.  Therefore,  to  predict  the  property  from  structure,  we 
need  a  nonempirical  (structural)  ordering  scheme  which  doseiy  resembles  the 
empirical  ordering  of  structures  as  determined  by  P.  This  is  a  more  specific 
theoretical  model  based  on  the  Same  model  object  (chemical  graph)  and  can  be 
accomplished  by  using  specific  graph  invariant(s). 

CHARACTERIZATION  OF  MOLECULAR  GRAPHS 

Molecular  graphs  can  be  characterized  by  graph  Invariants.  A  graph  Invariant  is 
a  graph-theoretic  property  which  is  preserved  by  isomorphism.^  A  graph 
invariant  could  be  a  polynomial,  a  sequence  of  numbers,  or  a  single  number. 
The  characteristic  polynomial  of  a  graph  and  the  spectra  of  graphs  are  graph 
invariants.  Numerical  graph  invariants  derived  from  molecular  graphs  are  called 
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graph-theoretic  indices  or  topological  indices.’®  Topological  indices  quantitatively 
describe  molecular  topology  and  are  sensitive  to  such  structural  attributes  as 
size,  shape,  patterns  of  branching,  bonding  types,  and  cyclicity  of  molecules. 

Topological  indices  (TIs)  can  sometimes  be  derived  conveniently  from 
different  matrices  such  as  the  adjacency  matrix  and  the  distance  matrix.  The 
origins  of  such  TIs  illuminate  the  fundamental  structural  features  that  they 
quantify.  On  the  other  hand,  some  indices  are  derived  to  quantify  a  key  structural 
feature  which  is  qualitative  and  only  understood  intuitively.  In  deriving  his  original 
connectivity  index  Ox).  Randic  asked  the  question;  which  of  the  two  heptane 
Isomers,  viz.,  3-methylhexane  and  3-ethyl  pentane,  is  more  branched.®  Until  that 
time,  branching  was  understood  only  inturtively;  Randic  derived  a  quanfrtative 
description  of  branching  based  on  the  graph-theoretic  treatment  of  the 
structures.  In  addition,  information  theoretic  indices  of  chemical  structures  have 
been  derived  to  answer  the  question:  which  of  a  collection  of  structures  is  more 
complex  or  heterogeneous?  Different  measures  of  molecular  complejdty  attempt 
to  answer  this  question  from  different  points  of  view.^  In  the  following  section  we 
discuss  the  structural  basis  and  method  of  calculation  for  some  of  the  major 
topological  indices. 

CALCULATION  OF  TOPOLOGICAL  INDICES 
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The  Wiener  Index  the  first  topological  Index  reported  in  the  chemical 
literature,  may  be  calculated  from  the  distance  matrix  D(G)  of  a  hydrogen- 
suppressed  chemical  graph  G  as  the  sum  of  the  entries  In  the  upper  triangular 
distance  submatrix.  The  distance  matrix  D(G)  of  a  nondirected  graph  G  with  n 
vertices  is  a  symmetric  nxn  matrix  (cfj),  where  d^is  equal  to  the  distance 
between  vertices  v,and  Vyin  G.  Each  diagonal  element  d^of  D(G)  is  zero.  We 
give  below  the  distance  matrix  D(G,)  of  the  labelled  hydrogen-suppressed  graph 
G,  of  2,3-dimethylhexane  (f^g.l): 

(1)  (2)  (3)  (4)  (5)  (6)  (7)  (8) 

01  223345 

1  0  1  1  2  2  3  4 

21  023345 

2  12  0  112  3 

3  *  2  3  1  0  2  3  4 

3  2  3  1  2  0  1  2 

434  2  3  1  01 
5  4  5  3  4  2  1  0 

W\s  calculated  as: 

0) 

where  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h.  Thus 
for  D(Gi),  W  has  a  value  of  seventy. 
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pnsert  Fig.  1  here] 


Randitfs  connectivity  index,®  and  higher-order  connectivity  path,  duster,  path- 
duster  and  chain  types  of  simple,  bond  and  valence  connedivity  parameters 
were  calculated  using  the  method  of  Kier  and  Hall.'®  parameters,  number  of 
paths  of  length  h{h  =  0,1  ,...,10)  In  the  hydrogen-suppressed  graph,  are 
calculated  using  standard  algorithms. 

Balaban  defined  a  series  of  indices  based  upon  distance  sums  within  the 
distance  matrix  for  a  chemical  graph  which  he  designated  as  J  indices.^"^ 
These  indices  are  highly  discriminating  with  low  degeneracy.  Unlike  W,  the  J 
indices  range  of  values  are  independent  of  molecular  size. 

Information-theoretic  topological  indices  are  calculated  by  the  application 
of  information  theory  on  chemical  graphs.  An  appropriate  set  A  of  n  elements  is 
derived  from  a  molecular  graph  G  depending  upon  certain  structural 
characteristics.  On  the  basis  of  an  equivalence  relation  defined  on  A  the  set  A 

is  partitioned  into  disjoint  subsets  A  of  order  n,  (/si,  2 . hi'Ln,^  n),  A 

probability  distribution  is  then  assigned  to  the  set  of  equivalence  dasses: 

A]i  A21 ......  A|| 

Pu  Pzf - »  Ph 
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where  p,=  n,l  n  is  the  probability  that  a  randomly  selected  element  of  A  will  occur 
in  the subset. 

The  mean  information  content  of  an  element  of  ^  is  defined  by  Shannon’s 
relation;'^ 

IC=-  J:  PiloQzPi  (2) 

The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in  bits. 

The  total  information  content  of  the  set  A  is  then  nx  1C. 

It  is  to  be  noted  that  the  information  content  of  a  graph  G  is  not  uniquely 
defined.  It  depends  on  how  the  set  A  is  derived  from  G  as  well  as  on  the 
equivalence  relation  which  partitions  A  into  disjoint  subsets  A;.  For  example, 
when  A  constitutes  the  vertex  set  of  a  chemical  graph  G,  two  methods  of 
partitioning  have  been  widely  us'ed:  a)  chromatic-number  coloring  of  G  where 
two  vertices  of  the  same  color  are  considered  equivalent,  and  b)  determination 
of  the  orbits  of  the  automorphism  group  of  G  thereafter  vertices  belonging  to  the 
same  orbit  are  considered  equivalent. 

Rashevsky  was  the  first  to  calculate  the  information  content  of  graphs 
where  "topologically  equivalent"  vertices  were  placed  in  the  same  equivalence 
dass.^  In  Rashevsky*s  approach,  two  vertices  u  and  v  of  a  graph  are  said  to  be 
topologically  equivalent  If  and  only  if  for  each  neighboring  vertex  u,  (/  =  1 , 2, ...,  #0 
of  the  vertex  u,  there  Is  a  distinct  neighboring  vertex  v,  of  the  same  degree  for 
the  vertex  v.  While  Rashevsky  used  simple  linear  graphs  with  indistinguishable 
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vertices  to  symbolize  molecular  structure,  weighted  linear  graphs  or  multigraphs 
are  better  models  for  conjugated  or  aromatic  molecules  because  they  more 
properly  reflect  the  actual  bonding  patterns,  /.e.,  electron  distribution. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding 
pattern,  Sarkar  etal.  calculated  information  content  of  chemical  graphs  on  the 
basis  of  an  equivalence  relation  where  two  atoms  of  the  same  element  are 
considered  equivalent  if  they  possess  an  identical  first*order  topological 
neighborhood.^  Since  properties  of  atoms  or  reaction  centers  are  often 
modulated  by  stereo-electronic  characteristics  of  distant  neighbors,  Le., 
neighbors  of  neighbors,  it  was  deemed  essential  to  extend  this  approach  to 
account  for  higher-order  neighbors  of  vertices.  This  can  be  accomplished  by 
defining  open  spheres  for  all  vertices  of  a  diemical  graph.  If  r  is  any  non¬ 
negative  real  number  and  vis  a  vertex  of  the  graph  G,  then  the  open  sphere  S(v, 
/)  is  defined  as  the  set  consisting  of  all  vertices  v,  in  G  such  that  d{v,v^  <  r. 
Therefore,  S(v,  0)  *=  <p,  S{v,  i)  =  vforO  <  r<  1,  and  S(v,/)  is  the  set  consisting  of 
vand  all  vertices  V|Of  G  situated  at  unit  distance  from  v,  if  1<r<2. 

One  can  oonstaict  such  open  spheres  for  higher  integral  values  of  r.  For  a 
particular  value  of  r,  the  collection  of  all  such  open  spheres  S(v,/),  where  v  runs 
over  the  whole  vertex  set  V,  forms  a  neighborhood  system  of  the  vertices  of  G.  A 
suitably  defined  equivalence  relation  can  then  partition  t^into  cfisjoint  subsets 
consisting  of  vertices  whidi  are  topologically  equivalent  for  i"*  order 
neighborhood.  Such  an  approach  has  been  developed  and  the  information- 


13 


theoretic  Indices  calculated  based  on  this  idea  are  called  indices  of 
neighborhood  symmetry 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two 
vertices  u,  and  v,  of  a  molecular  graph  are  said  to  be  equivalent  with  respect  to 
order  neighborhood  if  and  only  if  corresponding  to  each  path  ti„ ....  t/^  of 

length  r,  there  Is  a  distinct  path  v^,  v„ ....  v,of  the  same  length  such  that  the 
paths  have  similar  edge  weights,  and  both  Uo  and  are  connected  to  the  same 
number  and  type  of  atoms  up  to  the  r*  order  bonded  neighbors.  The  detailed 
equivalence  relation  has  been  described  in  earlier  studies.^* 

Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood  is 
completed,  /C,is  calculated  by  Eq.  2.  Subsequently,  Basak  etai.  defined  another 
information-theoretic  measure,  structural  information  content  (S/Q.  which  is 
calculated  as:  *  ‘ 

S/C,=  /C,./log2n  (3) 

where  /C,is  calculated  from  Eq.  2  and  n  is  the  total  number  of  vertices  of  the 
graph.”*® 

Another  information-theoretic  invariant,  complementary  information 
content  (C/Q,  is  defined  as: 


ClCf=\oQzn-  ICf 


(4) 
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CICf  represents  the  difference  between  maximum  possible  complexity  of  a  graph 
(where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the  realized 
topological  irtformation  of  a  diemical  species  as  defined  by 

In  Rg.  2,  the  calculation  of  /C„  S/C,  and  C/C,  is  demonstrated  for  the 
hydrogen-filled  graph  (G,)  of  2,3-dimethylhexane. 

[Insert  Rg.  2  here  ] 

The  information-theoretic  index  on  graph  distance,  calculated  from  the 
distance  matrix  D(G)  of  a  chemical  graph  G  as  follows:^’ 

/^  =  IVIogz  W-TQh*  h loQz h  (5) 

h 

m 

The  mean  information  index,  is  found  by  dividing  the  information 
index  by  M/.  The  information  theoretic  parameters  defined  on  the  distance 
matrix,  /V®  and  f/'  were  calculated  by  the  method  of  Raychaudhuty  ef  a/.^ 


THEORETICAL  METHODS 


Databases  and  Calculations 


Two  data  sets  were  used  for  this  study:  the  first  consists  of  the  seventy-four 
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alkanes  (C2-C9)  and  the  second,  more  heterogeneous  set  was  taken  from  the 
STARLIST  group  of  chemicals.®^  The  STARLIST  subset  Includes  219  chemicals 
for  which  HB,  was  equal  to  zero  and  calculated  log  P  values  fell  in  the  range  of  - 
2  to  5.5.  HB^  Is  a  measure  of  the  hydrogen  bonding  potential  of  a  chemical. 
Chemical  structures  for  these  compounds  were  encoded  using  the  SMILES  line 
notation  for  chemical  structures  and  entered  into  the  computer  program  POLLY 
version  2.3  for  the  calculation  of  indices.^  Table  I  provides  a  comprehensive  list 
and  brief  descriptions  for  these  indices. 

Statistical  Methods 

Initially  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one. 
This  is  routinely  done  to  scale  tire  indices  since  there  may  be  a  difference  of 
several  orders  of  magnitude  between  indices  and  some  may  equal  zero. 

From  the  original  sets  of  1 02  indices  calculated  for  both  data  sets,  it  was 
necessary  to  remove  some  indices.  Some  of  the  indices  for  the  set  of  alkanes 
(e.g.,  the  simple,  valence,  and  bond  connectivity  indices)  were  completely 
redundant  Other  indices  were  removed  because  they  had  values  of  zero  for  all 
compounds.  This  “cleaning"  of  the  sets  of  TIs  left  fifty-three  indices  for  the 
alkanes  and  ninety-eight  indices  for  the  STARLIST  set 

Variable  clustering  and  principal  component  analysis  were  used  on  the 
remaining  indices  to  minimize  problems  of  intercorrelation  amongst  the  indices. 
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TTie  variable  clustering  was  conducted  using  the  SAS  procedure  VARCLUS 
which  divides  the  indices  into  disjoint  dusters  which  are  essentially 
unidimensional  based  on  the  correlation  matrix”  From  each  duster,  the  index 
which  was  most  correlated  with  the  duster  was  selected  as  the  best 
representative  of  that  duster.  In  this  way,  individual  indices  are  retained  while 
minimizing  intercorrelations.  This  procedure  resulted  in  the  retention  of  eight  TIs 
for  the  alkanes;  SICo,  SIC,.  SIC4.  ®Xc.  *Xc.  P4.  PsI  and  twelve  TIs  for  the 
STARLISTdata;  1%,  IC4,  SIC3,  CIC1,  “xo,.  Vch.  Vc  Yc  V*pc.  Ps.  J®  Tl 
values  for  a  subset  of  the  alkanes,  the  eighteen  octane  isomers,  are  presented 

in  Table  II. 

The  prindpal  component  analysis  (PCA)  was  accomplished  using  the 
SAS  procedure  PRINCOMP.  The  PCA  produces  linear  combinations  of  the  TIs, 
called  prindpal  components  (PCs)  whldi  are  derived  from  the  correlation 
matrix.”  The  first  PC  has  the  largest  variance,  or  eigenvalue,  of  the  linear 
combination  of  TIs.  Each  subsequent  PC  explains  the  maximal  Index  variance 
orthogonal  to  previous  PCs,  eliminating  the  redundancy  which  can  occur  with 
TIs.  The  maximum  number  of  PCs  generated  is  equal  to  the  number  of  individual 
TIs  available.  For  the  purposes  of  this  study,  only  PCs  with  eigenvalues  greater 
than  one  were  retained.  A  more  detailed  explanation  of  this  approadi  has  been 
provided  in  a  previous  study  by  Basak  et  a/.®  The  seven  PCs  with  eigenvalues 
greater  than  one  and  the  ten  PCs  with  eigenvalues  greater  than  one  were 
retained  for  the  alkanes  and  STARLIST  set  respedively.  Table  111  presents  the 
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PCs  for  the  octane  isomers,  a  subset  of  the  seventy-four  alkanes. 

DISCRIMINATION  OF  ISOMERS  USING  TOPOLOGICAL  INDICES  AND 
PRINCIPAL  COMPONENTS  DERIVED  FROM  THEM 

Topological  aspects  of  chemicals  have  been  used  in  chemical  documentation. 
One  line  of  research  in  this  area  has  been  the  development  of  topological  indices 
which  are  more  discriminatory.  For  example,  the  J  index  developed  by  Balaban 
is  one  of  the  most  discriminatory  indices.  Randic  developed  the  concept  of 
molecular  identification  number  (I.D.  number)  by  combining  a  few  topological 
aspects  of  structures.  Other  authors  have  used  more  than  one  index  for  this 
purpose.  One  example  is  the  topological  superindex  proposed  by  Bonchev  et  al. 
where  they  use  a  collection  of  indices  as  the  superindex.®®  Two  structures  are 
said  to  be  distinct  if  the  magnitudes  of  any  one  of  the  component  indices  differ 
for  them. 

In  view  of  the  intercorrelation  of  indices  and  the  fact  that  a  large  number 
of  TIs  have  been  defined  In  the  literature,  we  have  been  interested  in  deriving 
orthogonal  parameters  from  TIs.  We  have  employed  two  statistical  methods: 
variable  clustering  and  principal  components  analysis  (PGA).  In  the  former 
method,  we  begin  with  the  TIs  calculated  by  POLLY  and  derive  a  small  set  of 
originsd  variables  which  are  minimally  intercorrelated.  In  the  case  of  seventy-four 
alkanes  the  method  retained  eight  indices.  In  case  of  PCA,  seven  ptindpal 
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components  (PCs)  are  derived  from  original  variables  ax\6  these  PCs  are  linear 
combinations  of  all  the  TIs.  For  the  STARUST  set,  twelve  TIs  were  retained  by 
variable  clustering,  while  ten  PCs  were  derived. 

We  were  interested  to  see  the  discriminatory  power  of  the  TIs  selected  by 
variable  clustering  v/s-a-wsthe  PCs.  Values  of  the  TIs  selected  by  the  variable 
clustering  technique  and  the  first  seven  PCs  with  eigenvalue  greater  than  1 .0  for 
the  set  of  eighteen  octane  isomers  are  presented  in  Tables  II  and  III 
respectively.  It  is  dear  from  the  data  that  some  individual  TIs  are  not  suffidently 
discriminatory  for  the  eighteen  octane  isomers.  On  the  other  hand,  each  PC  is 
unique  for  any  given  structure,  making  them  more  discriminatory  than  any 
individual  Tl.  In  the  interest  of  space,  the  values  of  the  TIs  and  PCs  for  all  of  the 
alkanes  and  for  the  STARUST  set  were  not  induded  in  the  tables,  however,  this 
information  is  availdWe  upon  request  from  the  authors. 

TOPOLOGICAL  INDEX  SPACE  VIS-A-VIS  PC  SPACE:  What  Do  They  Mean? 

Each  Tl  quantifies  certain  aspects  of  molecular  structure.  Distinct  indices 
selected  by  the  variedile  dustering  procedure  encode  different  information 
regarding  molecular  structure  (model  object).  For  example,  indices  like  the 
connectivity  index  or  Wiener  index  quantify  adjacency  information  of  the  simple 
planar  graph  model  of  molecules.  On  the  other  hand,  information  theoretic  graph 
invariants  quantify  the  degree  of  complexity  of  the  molecular  graph.  Intuitively, 
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these  are  distinct  aspects  of  molecular  structure  and  this  notion  is  borne  out  by 
the  result  of  variable  clustering  analysis  on  the  set  of  102  TIs  calcuiated  by 
POLLY.  It  is  tempting  to  speculate  that  each  index  retained  by  variable  clustering 
represents  one  distinct  aspect  of  molecular  architecture  and  that,  collectively,  the 
TIs  form  the  structure  space  of  the  set  of  chemicals.  Such  a  space  can  be  used 
for  the  discrimination  of  structures  and  structure-property  correlation.  The 
magnitudes  of  eight  TIs  for  the  eighteen  octane  isomers  show  that  the  TIs 
selected  by  variable  clustering  have  reasonable  power  for  discriminating 
isomeric  structures. 

At  the  level  of  PCs,  we  have  derived  a  certain  number  of  orthogonal 
variables  using  PCA  of  the  Indices.  For  the  alkanes  we  had  seven  PCs  with 
eigenvaiues  greater  than  1.0  (Table  III)  whereas  for  the  structurally  diverse  set  of 
219  compounds  we  had  ten  PCs  with  eigenvalues  greater  than  1.0.  This  result 
indicates  that  the  structure  space  for  the  set  of  219  molecules  is  more  complex 
than  that  for  the  set  of  seventy-four  alkanes.  This  is  In  agreement  with  our 
intuitive  notion  that  molecules  with  heteroatoms  and  many  functional  groups  are 
more  complex  than  molecules  devoid  of  any  heteroatom.  Finally,  the  pattern  of 
correlation  of  the  Individual  PCs  with  the  TIs  can  help  us  in  understanding  the 

nature  of  the  axes  derived  by  PCA  (Tables  IV  and  V). 

DISCUSSION 
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The  major  objectives  of  this  paper  were:  a)  to  illuminate  the  fundamental  nature 
of  mathematical  invariants  of  molecular  structure,  b)  to  study  the  utility  of  graph 
invariants  In  the  characterization  of  molecular  structure,  and  c)  to  study  the 
intercorrelation  of  indices  arid  extraction  of  orthogonal  variables  from  TIs. 

It  is  dear  from  the  results  presented  in  this  paper  that  the  various  dasses 
of  mathematical  invariants  quantify  different  aspeds  of  molecular  architecture. 
They  depend  prindpally  on  the  structural  model  (model  object)  used  for  the 
calculation  of  the  invariant  as  well  as  the  intuitive  aspect  of  molecular  structure 
they  are  used  to  quantify.  For  example,  connectivity  indices  and  neighbor 
complexity  Indices  were  designed  to  quantify  distinct  aspects  of  molecular 
structure.  The  results  of  variable  dustering  of  the  congeneric  set  of  alkanes  and 
the  diverse  set  of  21 9  chemicals  show  that  these  indices  encode  largely 
independent  stnjctural  information  about  these  molecules. 

Many  structural  schemes  have  been  developed  for  the  derivation  of 
numbers  or  sets  of  numbers  which  can  discriminate  dosely  related  structures  so 
that  they  can  be  useful  in  chemical  documentation.  The  results  presented  in  this 
paper  show  that  both  the  collection  of  indices  selected  by  variable  dustering  as 
well  as  the  PCs  can  discriminate  among  the  eighteen  octane  isomers  (Tables  II- 
V).  It  Is  also  dear  from  the  data  that  the  PCs  are  more  discriminatory  than  the 
individual  indices.  For  example,  each  PC  has  distinct  values  for  all  eighteen 
octane  isomers.  PCs  derived  from  TIs  have  sdso  been  used  in  the  discnmi  nation 
of  isospectral  molecular  graphs  where  individual  indices  show  a  high  degree  of 
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degeneracy.®® 

Variable  clustering  of  TIs  for  the  set  of  seventy-four  alkanes  retained  eight 
parameters  which  can  be  classified  into  three  subsets:  a)  P4,  and  Pe  which 
represent  generalized  size  and  shape;  b)  SICq.  SIC^,  and  SIC4  which  quantify 
molecular  complexity;  and  c)  ®Xc  and  ®Xc  which  encode  information  about 
molecular  branching,  in  the  case  of  the  more  diverse  set  of  219  chemicals,  the 
indices  retained  after  variable  clustering  fall  into  four  subclasses,  a)  I  yj,  P^,  and 
(general  sh^e  and  size);  b)  IC4,  SIC3,  and  CIC^  (complexity);  c)  ^Xch  and  ®x''ch 
(cyclicity);  and  d)  Vc.  Yc  Vpc.  and  J®  (branching),  A  perusal  of  results  from 
both  the  sets  indicate  that  distinct  indices  quantify  different  intuitive  aspects  of 
molecular  structure. 

A  similar  picture  emerges  from  the  principal  component  analysis  of  both 
sets  of  molecules.  The  first  PC  iS  strongly  correlated  with  variables  which 
quantify  shape  and  size.  The  next  important  factor  is  molecular  complexity  which 
is  encoded  by  the  second  PC  (Tables  IV  and  V).  The  higher  order  PCs  (3-5)  are 
strongly  correlated  with  invariants  which  quantify  such  subtle  stmctural  factors  as 
blanching,  cydidty,  etc.  It  may  be  mentioned  that  such  a  result  emerged  from 
our  earlier  studies  on  a  large,  diverse  set  of  3,692  chemicals,®'®’^ 

In  conclusion,  mathematical  invariants  derived  from  chemical  topology 
quantify  different  aspects  of  molecular  architecture  which  are  intuitively 
understood  by  the  chemist  One  can  create  a  structure  space  from  these 
invariants  taking  uncorrelated  structural  information  (indices  or  PCs).  Such 
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orthogonal  factors  can  be  useful  in  the  discrimination  of  closely  related  structures 
like  isomers  and  in  the  creation  of  structure  spaces.  Metrics  defined  on  such 
spaces  have  been  useful  in  the  quantrficadon  of  molecular  similarity.®^ 
Orthogonal  variables  derived  by  PCA  or  variable  clustering  can  also  be  used  in 
OSAR  studies  pertaining  to  pharmacology  and  toxicology.^-®’®’®®^^’^*®*^ 
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FIGURE  CAPTIONS 


Figure  1  •  Hydrogen-suppressed  graph  of  2,3-dimethylhexane. 

Rgure  2.  The  calculation  of  ICt,  SIC^  and  CICi  based  on  the  first  order 
neighborhoods  for  the  labeled  graph  of  2,3-dimethylhexane. 


3  5 


Labeled  Graph: 


Hi3-15  H16-I8 

W  M/ 

Ha  C3  C5  Hfi  Hg  Hio 


Hi-Ci-C2-C4-C6-C7-C8“Hi2 
H2  H4  H5  H7  Hg  Hii 


First  Order  Neighborhoods: 


Hi  Hig 

Cl,3,5,8 

C2.4 

•  •  • 

A 

A 

C 

:  c 

1 

✓ 

,  H  H  H  C  , 

HC  C  C 

I 

n 

in 

Subsets: 

(Hi-is) 

(Cl, 3^,8) 

• 

(C2.4) 

Probability  (p/): 

18/26 

.  4/26 

2/26 

Ce,? 


H  H  C  C 
IV 

(Cej) 

2/26 


IGi  =-2Pi*log2Pi 

=  2  •  2/26  •  log2  26/2  +  4/26  ♦  log2  26  + 18/26  •log2  26/18 
=  1.150  bits 


SICi  =  ICi  /  log2  26 
=  0.353  bits 

CICi  =  log2  26-  ICi 
=  2.108  bits 


Table  I.  Symbols  and  definitions  of  topological  indices. _ 

Infomiation  Index  for  the  magnitudes  of  distances  between  all  possible 
pairs  of  vertices  of  a  graph 

if  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance 
matrix  of  a  graph 


\°  Degree  complexity 

H''  Graph  vertex  complexity 

Graph  distance  complexity 

15  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 


loRB  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at 
its  maximum  neighborhood  of  vertices 

O  Order  of  neighborhood  when  1C,  reaches  its  maximum  value  for  the 
hydrogen-filled  graph 

Ml  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

Mg  A  Zagreb  group  paramTeter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 

1C,  Mean  information  content  or  complexity  of  a  graph  based  on  the  (r  = 

0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

SIC,  Strtictural  information  content  for  (r = 0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 

CIC,  Complementary  information  content  for  r*  (r = 0-6)  order  neighborhood 

of  vertices  in  a  hydrogen-filled  graph 


\  Path  connectivity  index  of  order  h  =  0-6 
Cluster  connectivity  index  of  order  h  =  3-6 
Path-duster  connectivity  index  of  order  h  =  4-6 
^ch  Chain  connectivity  index  of  order  h  =  3-6 
V  Bond  path  connectivity  index  of  order  h  =  0-6 
Bond  duster  connedivity  index  of  order  h  =  3-6 
Bond  chain  connectivity  index  of  order  h  =  3-6 
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Bond  path-cluster  connectivity  index  of  order  h  =  4-6 

Valence  path  connectivity  index  of  order  h  =  0-6 

Valence  cluster  connectivity  index  of  order  h  «=  3-6 

Valence  chain  connectivity  index  of  order  h  =  3-6 

Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Number  of  paths  of  length  h  =  0-1 0 

Balaban’s  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

Balaban’s  J  index  based  on  relative  covalent  radii 


Table  il.  TIs  selected  by  variabie  dusterinq  of  the  alkanes  (octane  isomers  listed). _ 

Isotner  Name _ SICq  SIC^  S1C4  ^Xc  Xc  ^4^8 

Octane  1-288  0.173  0.218  0.477  0.000  0.000  2  0 

2- methylheptane  1-233  0.173  0.248  0.561  0.342  0.000  2  0 

3- methyl  heptane  1.228  0.173  0.248  0.598  0.254  0.000  2  0 

4- methyI  heptane  1.215  0.173  0.248  0.503  0.254  0.000  2  0 

3-ethylhexane  1-1^  0.173  0.248  0.532  0.186  0.000  2  0 

2.2- dimethylhexane  1-157  0.173  0.248  0.495  0.940  0.000  2  0 

2.3- dimethyl  hexane  1.170  0.173  0.253  0.557  0.450  0.212  2  0 

2.4- dimethylhexane  1-171  0.173  0.253  0.557  0.529  0.000  2  0 

2.5- dimethylhexane  1.183  0.173  0.253  0.384  0.597  0.000  2  0 

3.3- dimethylhexane  1.137  0.173  0.248  0.548  0.792  0.000  2  0 

3.4- dimethylhexane  1.157  0.173  0.253  0.469  0.386  0.154  2  0 

3-ethyl-2-methylpentane  1.096  0.173  0.253  0.490  0.405  0.154  2  0 

3-ethyl-3-methyIpentane  1.073  0.173  0.248  0.421  0.656  0.000  1  0 

2.2.3- trimethylpentane  i.075  0.173  0.255  0.490  0.944  0.477  1  0 

2.2.4- trimethylpentane  1.083  0.173  0.255  0.450  1.088  0.000  2  0 

2.3.3- trimethylpentane  1.065  0.173  0.255  0.506  0.850  0.529  1  0 

2.3.4- trimethylpentane  1.097  0.173  0.225  0.413  0.620  0.326  2  0 

2,2,3,3-tetramethylbutane _ 0.997  0.173  0.218  0.218  1.253  1.179  0  0 


2- methylheptane  2.181  -4.236  1.097  0.386  1.100  0.300  -0.935 

3- methylheptane  2.817  -4.857  -0.307  0.921  0.368  0.366  -0.513 

4- methylheptane  1.338  -2.211  0.848  -0.821  0.005  -0.541  -0.904 

3-ethylhexane  1.553  -1077  -0.348  -0.494  -0.817  -0.651  -0.290 

2.2- dimethylhexane  1.163  0.007  -0.436  -0.878  1.367  1.383  0.638 

2.3- dimethyIhexane  2.122  -2.060  -1.546  0.502  -0.308  -0.253  -0.105 

2.4- dimethylhexane  2.089  -2.306  -1.372  -0.289  -0.205  0.004  0.291 

2.5- climethylhexane  -0.769  1.340  1.473  -2.659  0.612  -0.387  -1.443 

3.3- dimethylhexane  2.044  -0.573  -1.726  0.303  0.173  0.582  1.163 

3.4- dimethylhexane  0.807  0.228  -0.825  -0.696  -0.730  -1.223  -0.545 

3-ethyl-2-methylpentane  0.991  -0.035  -1.596  -0.672  -1.076  -1.438  0.110 

3-ethyl-3-methylperTtane  -0.035  2.870  -0.614  -0.909  -0.497  -1.178  0.271 

2.2.3- trimethylpentane  1-136  2.191  -2.383  1.277  0.465  -0.075  0.548 

2.2.4- trimethylpentane  0.377  2.377  -1.284  -1.846  0.726  0.461  1.676 

2.3.3- trimethylpentane  1.318  1.825  -2.717  1.990  0.318  -0.400  0.251 

2.3.4- trimethyIpentane  -0.548  4.168  1.329  0.020  -1.745  -1.140  -0.039 

2.2.3.3-tetramethylbutane  -4.473  12.522  2.681  4.256  1.345  -0.129  -2.627 


Table  V.  PC  loading  for  the  10  principal  components  with  eigenvalues  greater  than  1.0  forth'e  219  STARLIST  chemicals^ 

Ten  Most  Correlated  Indices _  . _  : _ 
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Abstract  -  Adequate  experimental  data  necessary  for  hazard  assessment  is  not  available  for  the 
majority  of  environmental  pollutants  and  chemicals  in  commerce.  This  has  led  to  the  increasing 
use  of  theoretical  structural  parameters  in  the  hazard  estimation  of  such  chemicals.  In  this  paper 
we  have  used  a  hierarchical  QSAR  approach  involving  topological  indices,  geometrical  (3-D) 
indices,  and  quantum  chemical  indices  to  estimate  the  mutagenicity  of  a  set  of  ninety-five 
aromatic  and  heteroaromatic  amines.  The  results  show  that  topological  indices  explain  the  major 
part  of  the  variance  in  mutagenicity.  The  addition  of  quantum  chemical  indices  to  the  set  of 
descriptors  makes  some  improvement  in  the  predictive  models. 

Keywords  -  Topological  Indices  Geometrical  Parameters  Quantum  Chemical  Parameters 
Mutagenicity  Hierarchical  QSAR 


INTRODUCTION 


The  assessment  of  the  environmental  and  human  health  hazard  posed  by  chemicals  is 
frequently  carried  out  using  insufficient  experimental  data.  This  is  true  for  industrial  chemicals,  as 
well  as  for  substances  identified  in  industrial  effluent,  hazardous  waste  sites  and  environmental 
monitoring  surveys  (Auer  et  al.  1990).  In  1984,  the  National  Research  Council  (NRC)  studied  the 
availability  of  toxicity  data  on  industrial  chemicals  and  found  that  many  of  these  chemicals  have 
very  little  or  no  test  data  (1984).  About  15  million  distinct  chemical  entities  have  been  registered 
with  the  Chemical  Abstract  Service  (CAS)  and  the  list  is  growing  by  nearly  750,000  per  year.  Out 
of  these  chemicals,  about  1,(X)0  enter  into  societal  use  every  year  (Arcos  1987).  Very  few  of  these 
chemicals  have  empirical  properties  needed  for  hazard  assessment.  In  the  United  States,  the  Toxic 
Substances  Control  Act  (TSCA)  inventory  has  over  72,000  entries  and  the  list  is  growing  by 
nearly  3,000  per  year  (GAO  1993).  Of  the  some  3,000  chemicals  submitted  yearly  to  the  United 
States  Environmental  Protection  Agency  (USEPA)  for  the  premanufacture  notification  (PMN) 
process,  less  than  50%  have  any  experimental  data  at  all,  less  than  15%  have  en:q)irical 
mutagenicity  data,  and  only  about  6%  have  ecotoxicological  and  environmental  fate  data.  The 
Superfund  list  of  hazardous  substances  has  only  limited  data  for  many  of  the  over  700  chemicals 
as  well  (Auer  et  al.  1990). 

This  pervasive  lack  of  empirical  data  shows  the  real  need  for  the  development  of  methods 
which  can  estimate  environmental  and  toxic  properties  of  chemicals  using  parameters  which  can 
be  calculated  directly  from  molecular  structure.  In  recent  years  we  have  been  involved  in  the 
development  of  such  models  (Basak  and  Magnuson  1983;  Basak  1987,  1990;  Basak  et  al.  1988, 
1994;  Balaban  et  al.  1994;  Basak  and  Grunwald  1994a,  1994b,  1995a- 1995e,  1996;  Basak, 
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Bertelsen  and  others  1995;  Basak.  Gute  and  others  1995,  1996a-1996c;  Basak,  Grunwald  and 
others  1996;  Basak  and  Gute  1996).  Specifically,  we  have  used  graph  theoretic  indices, 
geometrical  (3-D)  parameters,  and  semi-enq)irical  quantum  chemical  indices  in  the  development 
of  quantitative  structure-activity  relationship  (QS  AR)  models  pertinent  to  biomedicinal  chemistry 
and  toxicology.  In  this  paper  we  have  used  a  hierarchical  approach  in  the  development  of  QSARs 
for  a  group  of  ninety-five  aromatic  and  heteroaromatic  amines  using  topological  indices,  3-D 

parameters  and  a.  set  of  quantum  chemical  descriptors. 

The  purpose  in  using  a  hierarchical  approach  is  to  begin  to  look  at  the  iiiq)ortance  of  the 
contribution  of  different  classes  of  parameters  to  modeling  physicochenrical  or  biologically 
relevant  properties.  To  this  end  we  ask  the  question,  what  non-empirical  molecular  information  is 
adequate  for  the  estimation  of  mutagenic  potency?  Is  specific  chemical  or  quantum  chemical 
information  necessary  or  do  simple  structural  descriptors  do  an  adequate  job?  These  questions 
should  lead  us  to  a  deeper  understanding  of  the  principles  and  molecular  basis  for  detenruning 
mutagenic  potency. 


THEORETICAL  METHODS 


Database 

A  set  of  95  aromatic  and  heteroaromatic  amines,  previously  collected  from  the  literature 
by  Debnath  et  al.  (1992),  were  used  to  study  mutagenic  potency.  The  mutagenic  activities  of 
these  compounds  in  5.  typMmurium  TA98  +  S9  microsomal  preparation  are  expressed  as  the 
mutation  rate,  ln(R),  in  natural  logarithm  (revertants/nanomole).  Table  I  lists  the  compounds  used 
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in  this  study  and  their  experimentally  measured  mutation  rates. 


Computation  ofTopological  Indices 

Topological  indices  used  in  this  study  have  been  calculated  by  POLLY  2.3  (Basak  et  al. 
1988)  which  can  calculate  a  total  of  102  indices.  These  indices  include  Wiener  index  (Wiener 
1947),  connectivity  indices  (Kier  and  Hall  1986;  Randic  1975),  information  theoretic  indices 
defined  on  distance  matrices  of  graphs  (Raychaudhury  et  at.  1984;  Bonchev  and  Tiinajstic  1977), 
a  set  of  parameters  derived  on  the  neighborhood  complexity  of  vertices  in  hydrogen-filled 
molecular  graphs  (Basak  1987;  Basak  and  Magnuson  1983;  Basak  et  al.  1980;  Roy  et  al.  1984), 
as  well  as  Balaban’s  J  indices  (Balaban  1982, 1983, 1986).  Table  H  provides  brief  definitions  for 
the  topological  indices  included  in  fliis  study. 

Computation  of  Geometrical  Indices 

van  der  Waal’s  volume,' Vw,  (Bondi  1964;  Moriguchi  et  al.  1975;  Moriguchi  and  Kanada 
1977)  was  calculated  using  Sybyl  62  from  Tripos  Associates,  Inc  (1994).  The  3-D  Wiener 
numbers  (Bogdanov  et  al.  1989)  were  calculated  by  Sybyl  using  an  SPL  (Sybyl  Programming 
Language)  program  developed  in  our  laboratory.  Calculation  of  3-D  Wiener  numbers  consists  of 
the  sum  entries  in  the  upper  triangular  submatrix  of  the  topographic  Euclidean  distance  matrix  for 
a  molecule.  The  3-D  coordinates  for  the  atoms  were  determined  using  CONCORD  32.1  (Tripos 
1993).  Two  variants  of  the  3-D  Wiener  number  were  calculated:  ®Wh  and  ^‘^W.  For  ^"^Wh, 
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hydrogen  atoms  are  included  in  the  computations  and  for  hydrogen  atoms  are  excluded  from 

the  computations. 

Computation  of  Quantum  Chemical  Parameters 

The  quantum  chemical  parameters  Ehomo>  Ehomoi*  Elumo.  Elumou  AHr,  and  p  were 
calculated  for  all  of  the  following  semi-empirical  Hamiltonians:  AMI,  PM3,  MNDO,  MINDO/3. 
These  parameters  were  calculated  by  MOP  AC  6.00  in  the  SYBIL  interface  (Stewart  1985).  One 
difficulty  was  encountered  in  using  the  MINDO/3  Hamiltonian.  This  particular  interface  does  not 
include  the  information  necessary  for  handling  bromine,  present  in  three  of  the  ninety-five 
molecules.  To  avoid  omitting  any  compounds  from  one  of  the  naodels,  we  accounted  for  the 
bromine  by  substituting  dummy  atoms  which  were  assigned  the  Gasteiger-Huckel  charges 
calculated  for  the  original  bromine  atoms.  These  molecules  containing  the  dummy  atoms  with 
assigned  charges  were  then  entered  into  MOP AC  for  calculation. 

Data  Reduction 


Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one.  This  was 
done  since  the  scale  of  some  indices  may  be  several  orders  of  magnitude  greater  than  that  of  other 
indices  and  other  indices  may  equal  zero.  The  geometric  indices  were  transformed  by  the  natural 
logarithm  of  the  index  for  consistency,  the  addition  of  one  was  unnecessary. 

The  set  of  91  TIs  was  partitioned  into  two  distinct  sets:  topostmctural  indices  and 
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topochemical  indices.  Topostructural  indices  are  indices  which  encode  information  about  the 
adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs)  irrespective  of  the 
chemical  nature  of  the  atoms  involved  in  the  bonding  or  factors  like  hybridization  states  of  atoms 
and  number  of  cote/valence  electrons  in  individual  atoms.  Topochemical  indices  are  parameters 
which  quantify  infonnation  regarding  the  topology  (connectivity  of  atoms)  as  well  as  specific 
chemical  properties  of  the  atoms  comprising  a  molecule.  Topochemical  indices  are  derived  from 
weighted  molecular  graphs  where  each  vertex  (atom)  is  properly  weighted  with  selected 
chemical/physical  prop>erties.  These  sets  of  the  indices  are  shown  in  Table  IIL 

According  to  Topliss  and  Edwards,  in  conducting  QS  AR  studies  it  is  important  to  bear  in 
mind  that  the  indiscriminate  use  of  too  many  independent  variables  can  lead  to  spurious  (chance) 
correlations  (1979).  Using  their  findings,  we  have  determined  that  for  a  set  of  95  compounds  no 
more  than  60  independent  variables  can  be  used  in  generating  regression  analyses  with  explained 
variance  (R^)  of  0.7  or  greater.  It  must  be  kept  in  mind  that  this  is  the  total  number  of  variables 
initially  used  in  modeling,  not  the  final  number  of  variables  used  m  the  naodel.  This  number  of 
independent  variables  should  keep  the  probability  of  chance  correlations  below  the  0.01  level 

To  reduce  the  number  of  independent  variables  that  we  would  use  for  model  construction, 
the  sets  of  topostructural  and  topochemical  indices  were  further  divided  into  subsets,  or  clusters, 
based  on  the  correlation  matrix  using  the  SAS  procedure  VARCLUS  (SAS  1988).  The 
VARCXUS  procedure  divides  the  set  of  indices  into  disjoint  clusters  so  that  each  cluster  is 
essentially  unidimensional. 

From  each  cluster  we  selected  tire  index  most  correlated  with  the  cluster,  as  well  as  any 
indices  which  were  poorly  correlated  with  the  cluster  (r  <  0.70).  These  indices  were  then  used  in 
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the  modeling  of  mutagenic  potency  of  aromatic  and  heteroaiomatic  amines.  The  variable 
clustering  and  selection  of  indices  was  performed  independently  for  both  the  topostructural  and 
topochemical  subsets. 

Statistical  Analysis  and  HierarcHcal  QSAR 

Regression  modeling  was  accomplished  using  the  SAS  procedure  REG  on  thirteen  sets  of 
indices.  These  sets  were  constructed  as  part  of  a  hierarchical  approach  to  QSAR  model 
development  The  hierarchy  begins  with  the  simplest  indices,  the  topostructural.  After  using  the 
topostructural  indices  to  model  the  activity,  we  then  proceed  to  add  the  next  level  of  complexity, 
the  topochemical  indices  from  the  clustering  procedure,  and  proceed  to  model  the  activity  using 
these  parameters.  Likewise,  the  indices  included  in  the  model  selected  from  this  procedure  are 
combined  with  the  indices  from  die  next  level,  the  geometrical  indices,  and  modeling  is  conducted 
once  again.  Finally,  the  best  model  utilmng  topostructural,  topochemical  and  geometrical  indices 
is  combined  with  the  quantum  chemical  parameters  and  modeling  is  conducted.  This  final  step  was 
repeated  four  times,  each  time  using  quantum  chemical  parametos  from  a  different  semi-empirical 
Hamiltonian,  namely,  AMI,  PM3,  MNDO,  MINDO/3.  Tlius  quantum  chemical  models  are 
developed  individually,  one  using  the  AMI  parameters,  one  using  the  MNDO  parameters,  one 
using  the  PM3  parameters,  and  one  using  the  MINDO/3  parameters.  The  regression  analysis 
resulted  in  the  final  selection  of  indices  for  each  of  the  models. 
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RESULTS  AND  DISCUSSION 


The  variable  clustering  of  topostructural  and  topochemical  indices  resulted  in  8 
topostructural  and  13  topochemical  indices  being  retained  for  model  construction  (see  Table  IH). 
The  results  for  the  all  possible  subsets  regression  analyses  have  been  summarized  in  Table  IV. 
Since  all  sets  were  well  under  25  parameters,  all  possible  subsets  regression  was  used  for  all 

analyses. 

As  can  be  seen  from  Table  IV,  using  only  the  topostructural  class  of  indices  resulted  in  a 
four  parameter  model  to  estimate  ln(R)  with  a  variance  explained  {R^)  of  72.1%  and  a  standard 
error  (5)  of  1.04  (equation  1).  The  Po  and  J  indices  are  related  to  the  size  and  shape  of  molecular 
graphs;  the  ^%pc  encodes  information  about  the  degree  of  branching  of  molecular  graphs;  the  O 
parameter  is  related  to  the  degree  of  symmetry  of  graphs  (Basak  et  al.  1987).  Therefore,  size, 
branching,  and  symmetry  (or  complexity)  of  skeletal  graphs  corresponding  to  molecular  structures 
seem  to  be  the  predominant  factors  in  determining  mutagenic  potency  of  the  set  of  95  aromadc 
amines. 

The  second  step  of  the  hierarchical  method  combined  the  four  topostructural  parameters 
from  equation  1  with  the  set  of  thirteen  topochemical  parameters.  The  resulting  model  for 
estimation  of  ln(R)  included  six  parameters  (equation  8)  which  had  an  of  75.2%  and  a  s  of 
0.99.  Thus  we  see  that  the  addition  of  topochemical  information  does  lead  to  an  increase  in  the 
explained  variance,  improving  our  model  without  greatly  increasing  the  number  of  independent 
variables.  The  independent  variables  of  equation  8  quantify :  a)  shape  and  size  of  molecular 
graphs  (J,  Po),  b)  branching  (^Xpc),  c)  molecular  complexity  /  redundancy  (SIC2,  SIC4),  and  d) 
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degree  of  cyclicity  (Yc).  It  may  be  mentioned  that  we  have  found  very  similar  set  of 
topostructural  and  topochemical  parameters  useful  in  estimating  normal  boiling  point,  octanol 
water  partition  coefficient  (Basak,  Gute  and  others  1996c),  and  vapor  pressure  (Basak,  Gute  and 
others  1996d)  of  diverse  sets  of  molecules. 

The  next  step  of  the  hierarchical  method  takes  this  topostructural  +  topochemical  model 
and  adds  the  three  geometric  indices,  however,  this  actually  led  to  a  decrease  in  die  explained 
variance.  As  part  of  model  construction,  it  became  necessary  to  eliminate  Po  from  the  set  of 
indices  when  adding  the  hydrogen-suppressed  3-D  Wiener  number  because  of  resulting  problems 
with  variance  inflation  between  the  two  parameters.  As  a  result,  the  model  which  retained  the 
geometric  parameter  had  a  slighdy  lower  and  s  values  than  the  model  using  topostractural  and 

topochemical  only  (equation  9).  This  being  the  case,  we  chose  to  use  the  parameters  from 
equation  8  in  the  following  trrodeling  with  the  quantum  chemical  parameters.  Thus,  the  last  four 
models  were  all  constructed  with  the  sfac  parameters  from  equation  8  and  all  sbt  quantum  chemical 
parameters  for  the  particular  Hamiltonian  methodology  available  for  modeling. 

As  can  be  seen  from  Table  TV,  the  AMI  parameters  made  the  most  significant  contribution 
to  our  hierarchical  modeling  procedure  (/?^=  79.1%,  s  =  0.92).  The  other  three  methods  showed 
only  minimal  improvement  over  the  topostructural  +  topochemical  model 

Finally,  individual  models  using  only  topochemical,  only  geometrical,  and  only  quantum 
cherrucal  parameters  were  constructed  to  further  our  understanding  of  the  individual  contribution 
of  these  different  types  of  parameters.  The  topochemical  model  was  the  strongest  of  the  three, 
with  the  geometrical  and  quantum  chemical  noodels  showing  little  effectiveness.  The  topochemical 
model  included  six  parameters  and  did  show  a  slight  increase  in  explained  variance  and  standard 
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error  over  the  topostructural  model. 

The  goal  of  the  paper  was  to  investigate  the  relative  effectiveness  of  theoretical  structural 
parameters;  namely  topostructural,  topochemical,  geometrical  and  quantum  chemical  parameters; 
in  predicting  the  mutagenicity  of  a  set  of  aromatic  and  heteroaromatic  amines.  To  this  end,  we 
used  a  hierarchical  approach  in  the  development  of  QS  ARs  using  four  classes  of  molecular 

descriptors. 

Hie  results  show  that  the  topostractural  parameters  explain  a  large  fraction  of  the  variance 
(R^)  in  the  mutagenic  potency  of  the  amines.  The  best  model  in  this  area  explained  about  72%  of 
variance  in  mutagenicity  using  O,  V.  Po,  J.  These  indices  do  not  contain  any  explicit  chemical 
information  about  the  molecules.  The  large  explained  variance  probably  indicates  that  general 
structural  features  like  size,  shape,  symmetry,  and  branching  play  a  major  role  in  determining 
mutagenic  potency.  The  addition  of  topochemical  variables  made  some  improvement  in  the 
explained  variance.  The  best  model  using  topostructural  and  topochemical  indices  explained  about 
75%  of  variance  in  mutagenicity.  The  addition  of  geometrical  parameters,  however,  did  not  make 
any  improvement  in  estimation.  Finally,  tiie  addition  of  quantum  chemical  parameters  was 
attempted.  Indices  from  AMI,  PM3,  MNDO  and  MIND03  were  used  separately  in  developing 
the  QSAR  models.  While  addition  of  tiie  heat  of  formation,  dipole  moment  and  Ehomoi 
parameters  calculated  by  the  AMI  method  provided  some  improvement  in  the  estimation  of  ln(R), 
parameters  calculated  by  PM3,  MIND03  and  MNDO  did  not  make  any  significant  improvement 
in  the  estimation  of  mutagenic  potency.  The  calculated  values  for  the  parameters  used  in  the 
hierarchical  model  which  included  the  AMI  parameters  (equation  10)  are  presented  in  Table  V. 
These  values  represent  the  original,  non-transformed  values  for  all  indices  used  in  equation  10. 
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Using  the  same  set  of  aromatic  amines  Debnath  et  al.  (1992 )  developed  various  QS  AR 
models  using  hydrophobidty  (logP.  octanol/water),  Ehomo  and  Eu^aq  calculated  by  the  AMI 
Hamiltonian  and  some  indicator  variables.  For  the  largest  subset  (n  =  88),  they  derived  the 
following  model: 

In  (R)  =  7.20  +  1.08(log  P)  +  1.28(Ehomo)  -  0.73(Elumo)  +  1.46(Il) 

5  =  0.860,  F=12.6,  /?^=  0.806 

The  model  in  equation  10  is  comparable  to  the  model  developed  by  Debnath  et  al.  and 
uses  all  the  95  aromatic  amines  as  compared  to  a  smaller  subset  (n=88)  used  in  their  study. 
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LEGEND  FOR  HGURE: 


Rgure  1. 


Scatterplot  for  observed  In  (R)  vs.  estimated  In  (R)  using  equation  10  for  a  set  of 
95  aromatic  and  heteroaromatic  amines. 


Table  L  Observed  and  estimated  mutagenic  potency  [ln(revertants/nmol)] 
QfYimjitir  and  heteroaromatic  amines.  - - — 

for  ninety-five 

No. 

Compound 

Exp.  ln(R) 

Est  ln(R) 
(eq.lO) 

1 

2-bromo-7-aminofluorene 

2.62 

1.10 

2 

2-methoxy-5-methylaniline  (p-crcsidine) 

-2.05 

-3.13 

3 

5-aminoquinoline 

-2.00 

-2.30 

4 

4-ethoxyaniline  (p-phenetidine) 

-2.30 

-3J6 

5 

l-aminonaphthalene 

-0.60 

-0.32 

6 

4-aminofluorcne 

1.13 

0.44 

7 

2-aminoanthracene 

2.62 

1.61 

8 

7-aminofluoranthene 

2.88 

2.54 

9 

8-aminoquinoline 

-1.14 

-1.66 

10 

1,7-diaminophenazine 

0.75 

1.36 

11 

2-amint)naphthalene 

-0.67 

-0.80 

12 

4-aminopyrcne 

3.16 

3.10 

13 

3-amino-3'-nittobiphenyl 

-0.55 

-0.19 

14 

2,4,5-trimethylaniline 

-1.32 

-0.74 

15 

3-aminofluorene 

0.89 

0;74 

16 

3,3'-dichloroben2ddine 

0.81 

0.24 

17 

2,4-dimethylaniline  (2,4-xylidine) 

-2.22 

-1.63 

18 

2,7-diaminofluorene 

0.48 

0.97 

19 

3-aminofluoranthene 

3.31 

2.57 

20 

2-aminofluorene 

1.93 

1.08 

21 

2-amino-4'-nitrobiphenyl 

-0.62 

0.37 

22 

4-aminobiphenyl 

-0.14 

0.06 

23 

3-methoxy-4-methylaniline  (o-cresidine) 

-1.96 

-3.27 

24 

2-aminocaibazole 

0.60 

0.60 

25 

2-ainino-5-nitrophenol 

-2.52 

-2.01 

26 

2,2’-diaimnobiphenyl 

-1.52 

-1.24 

27 

2-hydroxy-7-aininofluorene 

0.41 

1.61 

28 

l-aminophenanthrcne 

2.38 

i:80 

29 

2,5-dimethyianiline  (2^-xylidine) 

-2.40 

-1.55 

30 

4-aimno-2-nitiobiphenyl 

-0.92 

-0.50 

31 

2-aimno-4-methylphenol 

-2.10 

-2.43 

32 

2-aminophenaane 

0.55 

1.32 

33 

4-aminophenylsulfide 

0.31 

-0.47 

34 

2,4-dinitroaniline 

-2.00 

-0.75 

35 

2,4-diaininoisopropylbenzene 

-3.00 

-3.36 

36 

2,4-difluoroamline 

-2.70 

-1.29 

37 

4,4’-methylenedianiline 

-1.60 

-0.97 

38 

3,3'-<ii™ethylbenzidine 

0.01 

-0.23 

39 

2-aininofluoranthene 

3.23 

2.66 

40 

2-ainiiio-3'-nitrobiphenyl 

-0.89 

-0.42 

41 

l-aminofluoranthene 

3.35 

2.23 

42 

4,4'-ethylenebis  (aniline) 

-2.15 

-0.92 

43 

4-chloroaniline 

-2.52 

-2.94 

44 

2-aminophenanthrene 

2.46 

1.96 

45 

4-fluoroaniIine 

-3.32 

-2.57 

46 

9-aminophenanthrene 

2.98 

1.13 

47 

3,3'-dia™inobiphenyl 

-1.30 

-0.20 

48 

2-aminopyrcne 

3.50 

2.58 

49 

2,6-dichloro- 1 ,4-phenylenediamine 

-0.69 

-1.46 

50 

2-ainino-7-acetamidofluorene 

1.18 

0.89 

51 

2,8-diaininophena2ine 

1.12 

1J5 

52 

6-aniinoquinoline 

-2.67 

-2,31 

53 

4-inedioxy-2-inethylaniline  (m-Cresidine) 

-3.00 

-2.44 

54 

3-ainino-2'-nitrobiphenyl 

-1.30 

-0.90 

55 

2,4'-diaimnobiphenyl 

-0.92 

-0.40 

56 

1 ,6-diaininophenazine 

0.20 

0.20 

57 

4-ami  nnphenyldisulfide 

-1.03 

-1.00 

58 

2-bromo4,6-dinitroaniline 

-0.54 

-1.25 

59 

2,4-dianuno-n-butylbenzene 

-2.70 

-3.72 

60 

4-aminophenylether 

-1.14  . 

-0.76 

61 

2-anunobiphenyl 

-1.49 

-0.77 

62 

1 ,9-diaininophena2ine 

0.04 

0.09 

63 

1-aminofluorene 

0.43 

0.28 

64 

8-ammofluoranthene 

3.80 

2.69 

65 

2-chloioaniline 

-3.00 

-2.37 

66 

2-amino-a,a,a-trifluorotoluene 

-0.80 

-1.63 

67 

2-aimno-l-nitionj^hthalene 

-1.17 

-0.90 

68 

3-aimno4’-nitrobiphenyl 

0.69 

0.14 

69 

4-bronK)aniline 

-2.70 

-3.08 

70 

2-airano-4-chlorophenol 

-3.00 

-2.39 

71 

3,3'-ditnethoxybcnadine 

0.15 

0.05 

72 

4-cyclohexylaniline 

-1.24 

-0.73 

73 

4-phenoxyanilinc 

0.38 

-0.50 

74 

4,4’-methylencbis  (o-ethylaniline) 

-0.99 

-0.51 

75 

2-amino-7 -nitrofluorene 

3.00 

1.19 

76 

benzidine 

-0.39 

-0.52 

77 

l-amino-4-nitronaphthalene 

-1.77 

-0.95 

78 

4-amino-3'-nitrobiphenyl 

‘  1.02 

0:47 

79 

4-amino-4'-nitrobiphenyl 

1.04 

0.73 

80 

l-aminophenazine 

-0.01 

1.28 

81 

4,4’-methylenebis  (o-fluoroaniline) 

0.23 

0.41 

82 

4-chloro-2-nitroaniline 

-2.22 

-2.06 

83 

3-aimnoquinoline 

-3.14 

-2.22 

84 

3-aminocarbazole 

-0.48 

0.60 

85 

4-chloro-l  ,2-phenylenediamine 

-0.49 

-2.01 

86 

3-aminophenanthrene 

3.77 

1.79 

87 

3,4'-diaminobiphenyl 

0.20 

-0.34 

88 

1-aminoanthracene 

1.18 

1.86 

89 

1-aminocaibazole 

-1.04 

0.65 

90 

9-aminoanthracene 

0.87 

1.15 

91 

4-aminocarbazole 

-1.42 

0.38 

92 

6-aminochrysene 

1.83 

3.41 

93 

1-aminopyrcne 

1.43 

3.51 

94 

4-4'-methylenebis  (o-isopropyl-aniline) 

-1.77 

-1.13 

95 

2,7-diaimnophenazine 

3.97 

1.93 

I 


Table  IL  Symbols  and  defimtions  of  toTX)logical  and  geometrical  parang  - 

informationindexforthemgnitudesofdistancesbetweenaUpossiblepairsof 

vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a 

graph 

Degree  complexity 
Graph  vertex  complexity 


IC 

loRB 

o 

Ml 

M2 

IC 

SIC 


CIC 

“x 

“xc 

‘‘Xpc 

‘‘Xo. 


Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences 
of  distance  h 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  xts 
maximum  neighborhood  of  vertices 

Oder  of  neighborhood  when  IG  reaches  its  maximum  value  for  the  hydrogen-filled 

graph 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  aU  nerghbonng 
(connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r*  (r  =  0-6)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  r*  (r  =  0-6)  order  neighborhood  of  vertices  in  a 
hydrogen-filled  graph 

Complementary  information  content  for  r*  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

Path  connectivity  index  of  order  h  =  0-6 

Cluster  connectivity  index  of  order  h  =  3-5 

Path-cluster  connectivity  index  of  order  h  =  4-6 

Chain  connectivity  index  of  order  h  =  5, 6 

Bond  path  connectivity  index  of  order  h  =  0-6 

Bond  cluster  connectivity  index  of  order  h  =  3, 5 


V 

‘‘Xpc 

Ph 

J 

f 

Vw 


Bond  chain  connectivity  index  of  order  h  =  5, 6 
Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Valence  path  connectivity  index  of  order  h  =  0-6 
Valence  cluster  connectivity  index  of  order  h  =  3, 5 
Valence  chain  connectivity  index  of  order  h  =  5, 6 
Valence  path-cluster  connectivity  index  of  order  h  =  4-6 
Number  of  paths  of  length  h  =  0- 10 
Balaban's  J  index  based  on  distance 
Balaban's  J  index  based  on  bond  types 
Balaban's  J  index  based  on  relative  electronegativities 
Balaban’s  J  index  based  on  relative  covalent  radii 
van  der  Waal’s  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 


Table  IIL  Classification  of  parameters  used  in  developing  models  for  mutagenic  potency  (InR), 

Quantum  Chemical: 
AMI,  PM3,  MNDO, 


Topological 

Topochemical 

Geometric 

MINDO/3 

I<»B 

Vw 

Ehomo 

ICo-ICe 

3Dw 

Ehomoi 

w 

SICo-SICe 

Elumo 

f 

aCo-CIQ 

Elumoi 

Y-Y 

AHT 

Ycand  Yc 

IC 

Ych  and  Yo. 

o 

Ypc-Yrc 

Ml 

Y-Y 

Mi 

VcandYc 

Ych  and  Yo. 

^Xcand^Xc 

Ypc-Ypc 

^chand^Xch 

J® 

^Xpc-*Xpc 

Po  ■  P 10 

J 


Table  IV.  Summarv  of  regression  results  for  all  classes  of  parameters. 
ftn.  narameter  class  variables  included  P 


J. 

1 

topostmctural 

O,  ^Xpc>  Po.  J 

58.1 

0.721 

1.04 

2 

topochemical 

IC4,  SIC2,  SIC4.  V.  Vc,  Vrc 

41.1 

0.737 

1.02 

3 

geometric 

61.8 

0.399 

150 

4 

Qc:AMl 

Ehomoi*  Elumo.  M- 

31.8 

0.512 

1.37 

5 

Qc:MNDO 

Ehomoi»  Elumo 

54.7 

0.543 

1.31 

6 

Qc:  MINDO/3 

Ehomo,  Elumo» 

32.4 

0.517 

1.36 

7 

Qc:PM3 

Ehomoj  Ehomoi  »  Elumo 

30.0 

0.497 

1.39 

8 

topostructural  + 

^Xpc>  Po?  SIC2,  SIC4?  X  c 

44.5 

0.752 

0.99 

9 

topochemical 
topostructural  + 
topochemical  + 

V.J.  SIC2,SIC4,Yc,"''W 

42.9 

0.746 

1.00 

10 

geometric 
topostructural  + 
topochemical  + 

"'Xpc*  Po,  J,  SIC2,  SIC4,  Ehomoi, 

35.8 

0.791 

0.92 

11 

geometric  + 

AMI 

topostructural  + 
topochemical  + 

V.  Po.  J.  SIC2.  SIC4,  Yc,  AHr 

40.4 

0.765 

0.97 

12 

geometric  + 
MNDO 

topostructural  + 
topochemical  + 

‘*Xpc,  Po,  I,  SIC2,  SIC4,  Eiojmo 

45.8 

0.758 

0.98 

13 

geometric  + 
MINDO/3 

topostructural  + 
topochemical  + 

V.  Po,  h  SIC2,  SIC4,  Yc,  AE^ 

39.7 

0.761 

0.98 

geometric  + 
PM3 


Table  V.  Calculated  values  for  the  topostructural,  topochemical,  and  AMI  quantum  chemical 


narameters  used  in  eouadon  10. 

No. 

'‘Xpc 

Po 

J 

SIC2 

SIC4 

Yc 

Ehomoi 

AHf 

1 

2.482 

15 

1.722 

0.780 

0.966 

0.080 

-9.510998 

57.462489 

3.246 

2 

1.409 

10 

2.356 

0.824 

0.875 

0.059 

-9.198889 

-24.061979. 

1.613 

3 

1.440 

11 

1.993 

0.831 

0.975 

0.058 

-9.528133 

51.959364 

2.993 

4 

0.841 

10 

2.132 

0.775 

0.818 

0.000 

-9.761040 

-22.045505 

1.782 

5 

1.440 

11 

1.993 

0.639 

0.931 

0.058 

-9.342732 

40.325881 

1.549 

6 

2.209 

14 

1.800 

0.697 

0.931 

0.109 

-9.019172 

53.561923 

1.377 

7 

2.148 

15 

1.673 

0.613 

0.885 

0.049 

-8.752501 

61.467301 

1.686 

8 

3.051 

17 

1.694 

0.616 

0.890 

0.119 

-8.883560 

90.631004 
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Abstract.  If  the  organic  carbon  in  sediment  or  soil  has  the  same  partitioning  properties  as 
octanol,  the  relationship  between  the  organic-carbon  normalized  sorption  coefficient,  with 
units  of  L/Kg,  and  the  octanol-water  partition  coefficient,  ,  is  log  =  log  +  0.214  at 
20  °C.  Observations  are  well  represented  by  log  =  log  —  0289 ;  this  is  calculated  using 
the  data  critically  reviewed  by  Baker  and  coworkers  (Water  Environ.  Res.,  62(2),  pl36-145 
(1997)).  We  conclude  that  partitioning  properties  of  the  organic  carbon  in  sediment  or  soil  are 
not  the  same  as  octanol;  experimental  values  of  are  about  one  third  of  those  expected  if  the 

organic  carbon  behaves  identically  to  octanol. 


When  considering  the  distribution  of  hydrophobic  nonionic  organic  compounds  in  aqueous 
systems,  are  the  partitioning  properties  of  octanol  the  same  as  those  of  the  organic  carbon  in  soils 
or  sediments?  If  so,  we  may  be  tempted  to  suppose  that 

=  K.  [1] 

where  is  the  sediment,  or  soil,  sorption  coefficient  and  is  the  octanol-water  partition 
coefficient.  These  coefficients  are  defined  as  follows: 

[2] 

where  C,.  is  the  mass  of  chemical  per  unit  mass  of  dry  sediment  or  soil,  is  the  fraction  of 
organic  carbon  in  the  dry  sediment  or  soil,  is  the  mass  of  chemical  per  unit  volume  of 
aqueous  phase,  and  C^,  is  the  mass  of  chemical  per  unit  volume  of  octanol.  There  is  ample 
evidence  to  believe  that  eq  1  holds  within  an  order  of  magnitude  or  so.  Because  the  partitioning 
of  nonionic  organic  chemicals  is  a  purely  physical  process,  we  may  be  tempted  to  believe  that 
the  organic  carbon  in  the  sediment  or  soil  is  no  different  fi'om  that  in  octanol  and  so  it  is 
reasonable  to  suppose  that  eq  1  holds. 

Our  purpose  is  to  put  eq  1  on  a  more  formal  footing  and  examine  closely  the  ideas  behind  it. 
We  do  not  like  the  equation  as  it  stands  unless  we  clearly  recognize  it  to  be  a  dimensional 
equation  (units  of  are  L/Kg  here  and  throughout);  the  concentration  basis  on  the  left-hand 
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side  is  not  the  same  as  the  right-hand  side  .  The  concentration  of  the  chemical  in  the  sediment, 

C,. ,  is  defined  as  the  mass  of  chemical  per  unit  mass  of  dry  sediment  or  soil  and  the 
concentration  of  the  chemical  in  the  octanol  is  defined  as  the  mass  chemical  per  unit  volume  of 
octanol.  However,  we  may  develop  a  relationship  between  and  ,  ensuring  a  consistent 
concentration  basis  thereby. 

To  do  this,  we  take  the  idea  that  the  organic  carbon  in  the  sediment  is  no  different  from  that 
in  the  octanol  and  construct  a  model  for  which  we  consider  the  equilibrium.  The  model  consists 
of  sediment,  water  and  octanol  in  a  container  (see  figure  la)  and  we  consider  the  distribution  of  a 
chemical  at  equilibrium  within  the  container.  The  octanol,  being  the  least  dense  phase,  is 
uppermost.  Now  we  assume  that  the  distribution  behavior  is  explained  by  considering  the 
sediment  to  behave  like  octanol.  Practically,  we  would  have  to  prepare  the  system  in  the 
configuration  as  shown  in  figure  lb  because  octanol  is  less  dense  than  water. 

At  equilibrium 

f  =  C°  /  K°  =  I  k'* 

^aq  ^ocf  '  ^ow  ^oct  ‘  ^ow 

or 

=  =  [31 

where  the  superscripts  “o”  and  “s”  represent  the  octanol  and  the  octanol  phase  representing  the 
sediment  respectively.  Now  we  consider  the  concentration  of  the  chemical  in  the  octanol  phase 
that  represents  the  sediment. 

CL=rnlV„,  [4] 

where  m  is  the  mass  of  chemical  in  the  volume  of  octanol,  .  Now  we  express  this 
concentration  in  terms  of  the  mass  of  chemical  per  unit  mass  of  octanol. 

[5] 

where  is  the  mass  of  octanol  and  is  the  density  of  octanol.  This  can  be  written  as 

Ci, = c;  ■/>„  =(C“  /  Fr  )■  Fr  -F*, 

where  F^‘  is  the  fraction  of  organic  carbon  in  octanol.  Now,  from  eq  3, 

=  C /Q,  =  [Q7(Fr -cji-fj: 

Tlie  concentration  term  within  the  square  brackets  on  the  right-hand  side  is  just  the  organic- 
carbon  normalized  sorption  coefficient  for  the  octanol  phase  representing  the  sediment.  So 

^OW  ^OC  ^  OC  r  QCt 

The  units  are  now  consistent.  In  the  logarithmic  form 

Iogi:„=logX„-log(F“F„)  [6] 

Here  we  have  dropped  the  superscripts  because  the  two  phases  are  indistinguishable.  The 
relationship  is  weakly  dependent  upon  temperature  because  of  the  density  term.  We  calculate 
the  fraction  of  organic  carbon  in  octanol  from  the  relative  molecular  masses;  =  0.738 ;  the 
density  of  octanol  (1)  at  20  °C  is  0.827  g/mL.  So,  at  20  ®C, 
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log^,,  =  logi:„,,+0214  [7] 

If  the  organic  carbon  in  the  sediment  behaves  identically  to  octanol,  then  we  expect  the 
relationships  in  eq  6  and  7  to  be  observed.  What  is  observed?  From  experimental  data,  many 
workers  have  developed  relationships  of  the  general  form 

\ogK^=a-\ogK„^-\rb  [1] 

where  values  of  a  and  b  are  determined  by  linear  regression.  To  make  the  essential  point  here, 
we  consider  only  the  recent  relationship  developed  by  Baker  and  coworkers(2);  they  developed 
selection  criteria  and  critically  reviewed  the  available  measurements.  They  found  the  following 
values,  for  1.7  <  log  <  7.0 ,  using  data  for  72  chemicals: 

a  =  0.903  ±  0.034;  b  =  0.094  ±  0.142;  and  =  0.91 
We  wish  to  compare  eq  7  with  this  result. 

However,  the  equilibrium  model  that  we  used  requires  a  =  1 ;  other  values  of  a  do  not 
have  a  physical  meaning  for  the  model  considered  here.  Using  the  data  of  Baker  and  coworkers, 
we  applied  a  regression  model  in  which  a  is  forced  to  be  unity(3).  We  obtain 

for  a  =  1;  b  =  -0289;  and  =  0.90 

If  octanol  and  the  organic  carbon  in  the  sediment  or  soil  are  indistinguishable,  then  we  expect 
b  =  +0214  (see  eq  7).  In  other  words,  experimental  values  of  are  about  one  third  of  the 
values  expected  if  the  sediment  or  soil  organic  carbon  were  to  have  the  same  partitioning 
properties  as  octanol.  Whereas  eq  1  is  a  useful  first  approximation,  it  should  not  be  concluded 
therefrom  that  the  partitioning  properties  of  octanol  and  the  organic  carbon  in  sediment  or  soils 
are  the  same. 
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A  B 


Figure.  This  shows  the  model  used  to  examine  the  relationship  between  and  . 

In  A  octanol  is  placed  on  a  aqueous  phase  that  lies  over  the  soil  or  sediment.  Upon  assuming 
that  the  sediment  behaves  like  octanol,  the  system  would  adopt  the  configuration  shown  in  B. 
The  model  considers  the  distribution  of  the  chemical  between  the  phases  at  equilibrium;  the 
symbols,  representing  the  concentrations  of  chemical  in  the  various  phases,  are  defined  in  the 
text. 
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